Re: [DISCUSS] inotify for detection of manually removed snapshots

Štefan Miklošovič Fri, 09 Aug 2024 08:08:33 -0700

I understand and agree. It is just that it would be cool if we avoided the
situation when there is a figurative ABC company which has these "bash
scripts removing snapshots from cron by rm -rf every second Sunday at 3:00
am" because "that was their workflow for ages".


I am particularly sensitive to this as Cassandra is very cautious when it
comes to not disrupting the workflows already out there.

I do not know how frequent this would be and if somebody started to
complain. I mean ... they could still remove it by hand, right? It is just
listsnapshots would not be relevant anymore without refreshing it. I think
that might be acceptable. It would be something else if we flat out made
manual deletion forbidden, which it is not.

On Fri, Aug 9, 2024 at 4:50 PM Bowen Song via dev <dev@cassandra.apache.org>
wrote:

> If we have the documentation in place, we can then consider the cache to
> be the master copy of metadata, and rely on it to be always accurate and up
> to date. If someone deletes the snapshot files from filesystem, they can't
> complain about Cassandra stopped working correctly - which is the same if
> they had manually deleted some SSTable files (they shouldn't).
> On 09/08/2024 11:16, Štefan Miklošovič wrote:
>
> We could indeed do that. Does your suggestion mean that there should not
> be a problem with caching it all once explicitly stated like that?
>
> On Fri, Aug 9, 2024 at 12:01 PM Bowen Song via dev <
> dev@cassandra.apache.org> wrote:
>
>> Has anyone considered simply updating the documentation saying this?
>>
>> "Removing the snapshot files directly from the filesystem may break
>> things. Always use the `nodetool` command or JMX to remove snapshots."
>> On 09/08/2024 09:18, Štefan Miklošovič wrote:
>>
>> If we consider caching it all to be too much, we might probably make
>> caching an option an admin would need to opt-in into? There might be a flag
>> in cassandra.yaml, once enabled, it would be in memory, otherwise it would
>> just load it as it was so people can decide if caching is enough for them
>> or they want to have it as it was before (would be by default set to as it
>> was). This puts additional complexity into SnapshotManager but it should be
>> in general doable.
>>
>> Let me know what you think, I would really like to have this resolved,
>> 18111 brings a lot of code cleanup and simplifies stuff a lot.
>>
>> On Wed, Aug 7, 2024 at 11:30 PM Josh McKenzie <jmcken...@apache.org>
>> wrote:
>>
>>> If you have a lot of snapshots and have for example a metric monitoring
>>> them and their sizes, if you don’t cache it, creating the metric can cause
>>> performance degradation. We added the cache because we saw this happen to
>>> databases more than once.
>>>
>>> I mean, I believe you, I'm just surprised querying out metadata for
>>> files and basic computation is leading to hundreds of ms pause times even
>>> on systems with a lot of files. Aren't most / all of these values cached at
>>> the filesystem layer so we're basically just tomato / tomahto caching
>>> systems, either one we maintain or one the OS maintains?
>>>
>>> Or is there really just a count of files well outside what I'm thinking
>>> here?
>>>
>>> Anyway, not trying to cause a ruckus and make needless noise, trying to
>>> learn. ;)
>>>
>>>
>>> On Wed, Aug 7, 2024, at 3:20 PM, Štefan Miklošovič wrote:
>>>
>>>
>>>
>>> On Wed, Aug 7, 2024 at 6:39 PM Yifan Cai <yc25c...@gmail.com> wrote:
>>>
>>> With WatcherService, when events are missed (which is to be expected),
>>> you will still need to list the files. It seems to me that WatcherService
>>> doesn't offer significant benefits in this case.
>>>
>>>
>>> Yeah I think we leave it out eventually.
>>>
>>>
>>> Regarding listing directory with a refresh flag, my concern is the
>>> potential for abuse. End-users might/could always refresh before listing,
>>> which could undermine the purpose of caching. Perhaps Jeremiah can provide
>>> more insight on this.
>>>
>>>
>>> Well, by default, it would not be refreshed every single time. You would
>>> need to opt-in into that. If there is a shop which has users with a direct
>>> access to the disk of Cassandra nodes and they are removing data manually,
>>> I do not know what to say, what is nodetool clearsnapshot and jmx methods
>>> good for then? I do not think we can prevent people from shooting into
>>> their feet if they are absolutely willing to do that.
>>>
>>> If they want to refresh that every time, that would be equal to the
>>> current behavior. It would be at most as "bad" as it is now.
>>>
>>>
>>> IMO, caching is best handled internally. I have a few UX-related
>>> questions:
>>> - Is it valid or acceptable to return stale data? If so, end-users have
>>> to do some form of validation before consuming each snapshot to account for
>>> potential deletions.
>>>
>>>
>>> answer below
>>>
>>> - Even if listsnapshot returns the most recent data, is it possible that
>>> some of the directories get deleted when end-users are accessing them? I
>>> think it is true. It, then, enforces end-users to do some validation first,
>>> similar to handling stale data.
>>>
>>>
>>> I think that what you were trying to say is that when at time T0
>>> somebody lists snapshots and at T1 somebody removes a snapshot manually
>>> then the list of snapshots is not actual anymore? Sure. That is a thing.
>>> This is how it currently works.
>>>
>>> Now, we want to cache them, so if you clear a snapshot which is not
>>> physically there because somebody removed it manually, that should be a
>>> no-op, it will be just removed from the internal tracker. So, if it is at
>>> disk and in cache and you clear it, then all is fine. It is fine too if it
>>> is not on disk anymore and you clear it, then it is just removed
>>> internally. It would fail only in case you want to remove a snapshot which
>>> is not cached, regardless whether it is on disk or not. The deletion of
>>> non-existing snapshot ends up with a failure, nothing should be changed in
>>> that regard, this is the current behavior too.
>>>
>>> I want to say that I did not write it completely correctly at the very
>>> beginning of this thread. Currently, we are caching only _expiring_
>>> snapshots, because we need to know what is their time of removal so we act
>>> on it later. _normal_ snapshots are _not_ cached _yet_. I spent so much
>>> time with 18111 that I live in a reality where it is already in, I forgot
>>> this is not actually in place yet, we are very close to that.
>>>
>>> OK thank you all for your insights, I will NOT use inotify.
>>>
>>>
>>> Just my 2 cents.
>>>
>>> - Yifan
>>>
>>> On Wed, Aug 7, 2024 at 6:03 AM Štefan Miklošovič <smikloso...@apache.org>
>>> wrote:
>>>
>>> Yes, for example as reported here
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-13338
>>>
>>> People who are charting this in monitoring dashboards might also hit
>>> this.
>>>
>>> On Wed, Aug 7, 2024 at 2:59 PM J. D. Jordan <jeremiah.jor...@gmail.com>
>>> wrote:
>>>
>>> If you have a lot of snapshots and have for example a metric monitoring
>>> them and their sizes, if you don’t cache it, creating the metric can cause
>>> performance degradation. We added the cache because we saw this happen to
>>> databases more than once.
>>>
>>> > On Aug 7, 2024, at 7:54 AM, Josh McKenzie <jmcken...@apache.org>
>>> wrote:
>>> >
>>> > 
>>> >>
>>> >> Snapshot metadata are currently stored in memory / they are cached so
>>> we do not need to go to disk every single time we want to list them, the
>>> more snapshots we have, the worse it is.
>>> > Are we enumerating our snapshots somewhere on the hot path, or is this
>>> performance concern misplaced?
>>> >
>>> >> On Wed, Aug 7, 2024, at 7:44 AM, Štefan Miklošovič wrote:
>>> >> Snapshot metadata are currently stored in memory / they are cached so
>>> we do not need to go to disk every single time we want to list them, the
>>> more snapshots we have, the worse it is.
>>> >>
>>> >> When a snapshot is _manually_ removed from disk, not from nodetool
>>> clearsnapshot, just by rm -rf on a respective snapshot directory, then such
>>> snapshot will be still visible in nodetool listsnapshots. Manual removal of
>>> a snapshot might be done e.g. by accident or by some "impatient" operator
>>> who just goes to disk and removes it there instead of using nodetool or
>>> respective JMX method.
>>> >>
>>> >> To improve UX here, what I came up with is that we might use Java's
>>> WatchService where each snapshot dir would be registered. WatchService is
>>> part of Java, it uses inotify subsystem which is what Linux kernel offers.
>>> The result of doing it is that once a snapshot dir is registered to be
>>> watched and when it is removed then we are notified about that via inotify
>>> / WatchService so we can react on it and remove the in-memory
>>> representation of that so it will not be visible in the output anymore.
>>> >>
>>> >> While this works, there are some questions / concerns
>>> >>
>>> >> 1) What do people think about inotify in general? I tested this on
>>> 10k snapshots and it seems to work just fine, nevertheless there is in
>>> general no strong guarantee that every single event will come through,
>>> there is also a family of kernel parameters around this where more tuning
>>> can be done etc. It is also questionable how this will behave on other
>>> systems from Linux (Mac etc). While JRE running on different platforms also
>>> implements this, I am not completely sure these implementations are
>>> quality-wise the same as for Linux etc. There is a history of
>>> not-so-quality implementations for other systems (events not coming through
>>> on Macs etc) and while I think we are safe on Linux, I am not sure we want
>>> to go with this elsewhere.
>>> >>
>>> >> 2) inotify brings more entropy into the codebase, it is another thing
>>> we need to take care of etc (however, it is all concentrated in one class
>>> and pretty much "isolated" from everything else)
>>> >>
>>> >> I made this feature optional and it is turned off by default so
>>> people need to explicitly opt-in into this so we are not forcing it on
>>> anybody.
>>> >>
>>> >> If we do not want to go with inotify, another option would be to have
>>> a background thread which would periodically check if a manifest exists on
>>> a disk, if it does not, then a snapshot does not either. While this works,
>>> what I do not like about this is that the primary reason we moved it to
>>> memory was to bypass IO as much as possible yet here we would introduce
>>> another check which would go to disk, and this would be done periodically,
>>> which beats the whole purpose. If an operator lists snapshots once a week
>>> and there is a background check running every 10 minutes (for example),
>>> then the cummulative number of IO operations migth be bigger than us just
>>> doing nothing at all. For this reason, if we do not want to go with
>>> inotify, I would also not implement any automatic background check. Instead
>>> of that, there would be SnapshotManagerMbean#refresh() method which would
>>> just forcibly reload all snapshots from scratch. I think that manual
>>> deletion of snapshots is not something a user would do regularly, snapshots
>>> are meant to be removed via nodetool or JMX. If manual removal ever happens
>>> then in order to make it synchronized again, the refreshing of these
>>> snapshots would be required. There might be an additional flag in nodetool
>>> listsnapshots, once specified, refreshing would be done, otherwise not.
>>> >>
>>> >> How does this all sound to people?
>>> >>
>>> >> Regards
>>> >
>>>
>>>
>>>

Re: [DISCUSS] inotify for detection of manually removed snapshots

Reply via email to