Re: [DISCUSS] inotify for detection of manually removed snapshots

Štefan Miklošovič Mon, 12 Aug 2024 02:00:53 -0700

I think we might achieve some compromise - it would be cached but when
asked for the existence of a snapshot, it would go to disk and it would
check if manifest.json exists as if it does not then the snapshot is
effectively deleted.


Paulo measured the times in his comment there (1)

listsnapshots_disk: 37 seconds
listsnapshots_cached: 36 milliseconds
listsnapshots_cached_checkexists: 4 seconds

We would go instead with listsnapshots_disk for
listsnapshots_cached_checkexists. Sure, we have the greatest speedup with
listsnapshots_cached but the cost of that is an unsynchronized view of that
when not refreshed. On the other hand, listsnapshots_cached_checkexists,
which holds it in memory but just checks the existence of a manifest, is
still one-magnitude better than what we have now. That is still a great
improvement and I think we could just go with it while everything would
behave the same.

I think the greatest slowdown is caused by repeatedly traversing the whole
directory tree, this is not necessary after we load it at the beginning,
the mere check of manifest existence is just enough. Also, the biggest
performance killer seems to be the computation of true disk size* which
needs to get the size of each file and sum it up. My idea was to also cache
the sizes but I am not sure of that, seems like an overkill to hold it all
in memory just for this purpose.

* true disk size is the size of files which are in the snapshot but they
are not in live data dir. If we take a snapshot, its true disk size is 0
because at the beginning every snapshot file is a hardlink to live data
dir. Over time, when SSTables in live data dir get compacted, the files in
snapshot dir will not have its counterpart in live data dir and true disk
size starts to be non 0 value. We can not "cache" this as this is variable.


(1)
https://issues.apache.org/jira/browse/CASSANDRA-18111?focusedCommentId=17870085&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17870085

On Mon, Aug 12, 2024 at 1:11 AM Josh McKenzie <[email protected]> wrote:

> I lean towards the documentation approach vs complicating the
> implementation.
>
> +1.
>
> My question around the need for the optimization was purely driven by the
> motivation of exploring whether the complexity of caching this data was
> warranted. From talking with Jeremiah a bit offline, sounds like there's
> been some really egregious degradation cases and the minimal memory
> footprint and status quo complexity of implementation is a good enough
> solution.
>
> On Fri, Aug 9, 2024, at 1:38 PM, Jordan West wrote:
>
> I lean towards the documentation approach vs complicating the
> implementation.
>
> For me personally: I regularly use shell commands to operate on snapshots.
> That includes listing them. I probably should use nodetool for it all
> instead though.
>
> Jordan
>
> On Fri, Aug 9, 2024 at 08:09 Štefan Miklošovič <[email protected]>
> wrote:
>
> I understand and agree. It is just that it would be cool if we avoided the
> situation when there is a figurative ABC company which has these "bash
> scripts removing snapshots from cron by rm -rf every second Sunday at 3:00
> am" because "that was their workflow for ages".
>
> I am particularly sensitive to this as Cassandra is very cautious when it
> comes to not disrupting the workflows already out there.
>
> I do not know how frequent this would be and if somebody started to
> complain. I mean ... they could still remove it by hand, right? It is just
> listsnapshots would not be relevant anymore without refreshing it. I think
> that might be acceptable. It would be something else if we flat out made
> manual deletion forbidden, which it is not.
>
> On Fri, Aug 9, 2024 at 4:50 PM Bowen Song via dev <
> [email protected]> wrote:
>
>
> If we have the documentation in place, we can then consider the cache to
> be the master copy of metadata, and rely on it to be always accurate and up
> to date. If someone deletes the snapshot files from filesystem, they can't
> complain about Cassandra stopped working correctly - which is the same if
> they had manually deleted some SSTable files (they shouldn't).
> On 09/08/2024 11:16, Štefan Miklošovič wrote:
>
> We could indeed do that. Does your suggestion mean that there should not
> be a problem with caching it all once explicitly stated like that?
>
> On Fri, Aug 9, 2024 at 12:01 PM Bowen Song via dev <
> [email protected]> wrote:
>
> Has anyone considered simply updating the documentation saying this?
>
> "Removing the snapshot files directly from the filesystem may break
> things. Always use the `nodetool` command or JMX to remove snapshots."
> On 09/08/2024 09:18, Štefan Miklošovič wrote:
>
> If we consider caching it all to be too much, we might probably make
> caching an option an admin would need to opt-in into? There might be a flag
> in cassandra.yaml, once enabled, it would be in memory, otherwise it would
> just load it as it was so people can decide if caching is enough for them
> or they want to have it as it was before (would be by default set to as it
> was). This puts additional complexity into SnapshotManager but it should be
> in general doable.
>
> Let me know what you think, I would really like to have this resolved,
> 18111 brings a lot of code cleanup and simplifies stuff a lot.
>
> On Wed, Aug 7, 2024 at 11:30 PM Josh McKenzie <[email protected]>
> wrote:
>
> If you have a lot of snapshots and have for example a metric monitoring
> them and their sizes, if you don’t cache it, creating the metric can cause
> performance degradation. We added the cache because we saw this happen to
> databases more than once.
>
> I mean, I believe you, I'm just surprised querying out metadata for files
> and basic computation is leading to hundreds of ms pause times even on
> systems with a lot of files. Aren't most / all of these values cached at
> the filesystem layer so we're basically just tomato / tomahto caching
> systems, either one we maintain or one the OS maintains?
>
> Or is there really just a count of files well outside what I'm thinking
> here?
>
> Anyway, not trying to cause a ruckus and make needless noise, trying to
> learn. ;)
>
>
> On Wed, Aug 7, 2024, at 3:20 PM, Štefan Miklošovič wrote:
>
>
>
> On Wed, Aug 7, 2024 at 6:39 PM Yifan Cai <[email protected]> wrote:
>
> With WatcherService, when events are missed (which is to be expected), you
> will still need to list the files. It seems to me that WatcherService
> doesn't offer significant benefits in this case.
>
>
> Yeah I think we leave it out eventually.
>
>
> Regarding listing directory with a refresh flag, my concern is the
> potential for abuse. End-users might/could always refresh before listing,
> which could undermine the purpose of caching. Perhaps Jeremiah can provide
> more insight on this.
>
>
> Well, by default, it would not be refreshed every single time. You would
> need to opt-in into that. If there is a shop which has users with a direct
> access to the disk of Cassandra nodes and they are removing data manually,
> I do not know what to say, what is nodetool clearsnapshot and jmx methods
> good for then? I do not think we can prevent people from shooting into
> their feet if they are absolutely willing to do that.
>
> If they want to refresh that every time, that would be equal to the
> current behavior. It would be at most as "bad" as it is now.
>
>
> IMO, caching is best handled internally. I have a few UX-related questions:
> - Is it valid or acceptable to return stale data? If so, end-users have to
> do some form of validation before consuming each snapshot to account for
> potential deletions.
>
>
> answer below
>
> - Even if listsnapshot returns the most recent data, is it possible that
> some of the directories get deleted when end-users are accessing them? I
> think it is true. It, then, enforces end-users to do some validation first,
> similar to handling stale data.
>
>
> I think that what you were trying to say is that when at time T0 somebody
> lists snapshots and at T1 somebody removes a snapshot manually then the
> list of snapshots is not actual anymore? Sure. That is a thing. This is how
> it currently works.
>
> Now, we want to cache them, so if you clear a snapshot which is not
> physically there because somebody removed it manually, that should be a
> no-op, it will be just removed from the internal tracker. So, if it is at
> disk and in cache and you clear it, then all is fine. It is fine too if it
> is not on disk anymore and you clear it, then it is just removed
> internally. It would fail only in case you want to remove a snapshot which
> is not cached, regardless whether it is on disk or not. The deletion of
> non-existing snapshot ends up with a failure, nothing should be changed in
> that regard, this is the current behavior too.
>
> I want to say that I did not write it completely correctly at the very
> beginning of this thread. Currently, we are caching only _expiring_
> snapshots, because we need to know what is their time of removal so we act
> on it later. _normal_ snapshots are _not_ cached _yet_. I spent so much
> time with 18111 that I live in a reality where it is already in, I forgot
> this is not actually in place yet, we are very close to that.
>
> OK thank you all for your insights, I will NOT use inotify.
>
>
> Just my 2 cents.
>
> - Yifan
>
> On Wed, Aug 7, 2024 at 6:03 AM Štefan Miklošovič <[email protected]>
> wrote:
>
> Yes, for example as reported here
>
> https://issues.apache.org/jira/browse/CASSANDRA-13338
>
> People who are charting this in monitoring dashboards might also hit this.
>
> On Wed, Aug 7, 2024 at 2:59 PM J. D. Jordan <[email protected]>
> wrote:
>
> If you have a lot of snapshots and have for example a metric monitoring
> them and their sizes, if you don’t cache it, creating the metric can cause
> performance degradation. We added the cache because we saw this happen to
> databases more than once.
>
> > On Aug 7, 2024, at 7:54 AM, Josh McKenzie <[email protected]> wrote:
> >
> > 
> >>
> >> Snapshot metadata are currently stored in memory / they are cached so
> we do not need to go to disk every single time we want to list them, the
> more snapshots we have, the worse it is.
> > Are we enumerating our snapshots somewhere on the hot path, or is this
> performance concern misplaced?
> >
> >> On Wed, Aug 7, 2024, at 7:44 AM, Štefan Miklošovič wrote:
> >> Snapshot metadata are currently stored in memory / they are cached so
> we do not need to go to disk every single time we want to list them, the
> more snapshots we have, the worse it is.
> >>
> >> When a snapshot is _manually_ removed from disk, not from nodetool
> clearsnapshot, just by rm -rf on a respective snapshot directory, then such
> snapshot will be still visible in nodetool listsnapshots. Manual removal of
> a snapshot might be done e.g. by accident or by some "impatient" operator
> who just goes to disk and removes it there instead of using nodetool or
> respective JMX method.
> >>
> >> To improve UX here, what I came up with is that we might use Java's
> WatchService where each snapshot dir would be registered. WatchService is
> part of Java, it uses inotify subsystem which is what Linux kernel offers.
> The result of doing it is that once a snapshot dir is registered to be
> watched and when it is removed then we are notified about that via inotify
> / WatchService so we can react on it and remove the in-memory
> representation of that so it will not be visible in the output anymore.
> >>
> >> While this works, there are some questions / concerns
> >>
> >> 1) What do people think about inotify in general? I tested this on 10k
> snapshots and it seems to work just fine, nevertheless there is in general
> no strong guarantee that every single event will come through, there is
> also a family of kernel parameters around this where more tuning can be
> done etc. It is also questionable how this will behave on other systems
> from Linux (Mac etc). While JRE running on different platforms also
> implements this, I am not completely sure these implementations are
> quality-wise the same as for Linux etc. There is a history of
> not-so-quality implementations for other systems (events not coming through
> on Macs etc) and while I think we are safe on Linux, I am not sure we want
> to go with this elsewhere.
> >>
> >> 2) inotify brings more entropy into the codebase, it is another thing
> we need to take care of etc (however, it is all concentrated in one class
> and pretty much "isolated" from everything else)
> >>
> >> I made this feature optional and it is turned off by default so people
> need to explicitly opt-in into this so we are not forcing it on anybody.
> >>
> >> If we do not want to go with inotify, another option would be to have a
> background thread which would periodically check if a manifest exists on a
> disk, if it does not, then a snapshot does not either. While this works,
> what I do not like about this is that the primary reason we moved it to
> memory was to bypass IO as much as possible yet here we would introduce
> another check which would go to disk, and this would be done periodically,
> which beats the whole purpose. If an operator lists snapshots once a week
> and there is a background check running every 10 minutes (for example),
> then the cummulative number of IO operations migth be bigger than us just
> doing nothing at all. For this reason, if we do not want to go with
> inotify, I would also not implement any automatic background check. Instead
> of that, there would be SnapshotManagerMbean#refresh() method which would
> just forcibly reload all snapshots from scratch. I think that manual
> deletion of snapshots is not something a user would do regularly, snapshots
> are meant to be removed via nodetool or JMX. If manual removal ever happens
> then in order to make it synchronized again, the refreshing of these
> snapshots would be required. There might be an additional flag in nodetool
> listsnapshots, once specified, refreshing would be done, otherwise not.
> >>
> >> How does this all sound to people?
> >>
> >> Regards
> >
>
>
>
>

Re: [DISCUSS] inotify for detection of manually removed snapshots

Reply via email to