Re: [DISCUSS] inotify for detection of manually removed snapshots

Josh McKenzie Sun, 11 Aug 2024 16:10:55 -0700

> I lean towards the documentation approach vs complicating the implementation. 
+1.


My question around the need for the optimization was purely driven by the 
motivation of exploring whether the complexity of caching this data was 
warranted. From talking with Jeremiah a bit offline, sounds like there's been 
some really egregious degradation cases and the minimal memory footprint and 
status quo complexity of implementation is a good enough solution.

On Fri, Aug 9, 2024, at 1:38 PM, Jordan West wrote:
> I lean towards the documentation approach vs complicating the implementation. 
> 
> For me personally: I regularly use shell commands to operate on snapshots. 
> That includes listing them. I probably should use nodetool for it all instead 
> though. 
> 
> Jordan 
> 
> On Fri, Aug 9, 2024 at 08:09 Štefan Miklošovič <[email protected]> wrote:
>> I understand and agree. It is just that it would be cool if we avoided the 
>> situation when there is a figurative ABC company which has these "bash 
>> scripts removing snapshots from cron by rm -rf every second Sunday at 3:00 
>> am" because "that was their workflow for ages".
>> 
>> I am particularly sensitive to this as Cassandra is very cautious when it 
>> comes to not disrupting the workflows already out there.
>> 
>> I do not know how frequent this would be and if somebody started to 
>> complain. I mean ... they could still remove it by hand, right? It is just 
>> listsnapshots would not be relevant anymore without refreshing it. I think 
>> that might be acceptable. It would be something else if we flat out made 
>> manual deletion forbidden, which it is not.
>> 
>> On Fri, Aug 9, 2024 at 4:50 PM Bowen Song via dev <[email protected]> 
>> wrote:
>>> __
>>> If we have the documentation in place, we can then consider the cache to be 
>>> the master copy of metadata, and rely on it to be always accurate and up to 
>>> date. If someone deletes the snapshot files from filesystem, they can't 
>>> complain about Cassandra stopped working correctly - which is the same if 
>>> they had manually deleted some SSTable files (they shouldn't).
>>> 
>>> On 09/08/2024 11:16, Štefan Miklošovič wrote:
>>>> We could indeed do that. Does your suggestion mean that there should not 
>>>> be a problem with caching it all once explicitly stated like that?
>>>> 
>>>> On Fri, Aug 9, 2024 at 12:01 PM Bowen Song via dev 
>>>> <[email protected]> wrote:
>>>>> Has anyone considered simply updating the documentation saying this?
>>>>> 
>>>>> "Removing the snapshot files directly from the filesystem may break 
>>>>> things. Always use the `nodetool` command or JMX to remove snapshots."
>>>>> 
>>>>> On 09/08/2024 09:18, Štefan Miklošovič wrote:
>>>>>> If we consider caching it all to be too much, we might probably make 
>>>>>> caching an option an admin would need to opt-in into? There might be a 
>>>>>> flag in cassandra.yaml, once enabled, it would be in memory, otherwise 
>>>>>> it would just load it as it was so people can decide if caching is 
>>>>>> enough for them or they want to have it as it was before (would be by 
>>>>>> default set to as it was). This puts additional complexity into 
>>>>>> SnapshotManager but it should be in general doable. 
>>>>>> 
>>>>>> Let me know what you think, I would really like to have this resolved, 
>>>>>> 18111 brings a lot of code cleanup and simplifies stuff a lot.
>>>>>> 
>>>>>> On Wed, Aug 7, 2024 at 11:30 PM Josh McKenzie <[email protected]> 
>>>>>> wrote:
>>>>>>>> If you have a lot of snapshots and have for example a metric 
>>>>>>>> monitoring them and their sizes, if you don’t cache it, creating the 
>>>>>>>> metric can cause performance degradation. We added the cache because 
>>>>>>>> we saw this happen to databases more than once.
>>>>>>> I mean, I believe you, I'm just surprised querying out metadata for 
>>>>>>> files and basic computation is leading to hundreds of ms pause times 
>>>>>>> even on systems with a lot of files. Aren't most / all of these values 
>>>>>>> cached at the filesystem layer so we're basically just tomato / tomahto 
>>>>>>> caching systems, either one we maintain or one the OS maintains?
>>>>>>> 
>>>>>>> Or is there really just a count of files well outside what I'm thinking 
>>>>>>> here?
>>>>>>> 
>>>>>>> Anyway, not trying to cause a ruckus and make needless noise, trying to 
>>>>>>> learn. ;)
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Aug 7, 2024, at 3:20 PM, Štefan Miklošovič wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Aug 7, 2024 at 6:39 PM Yifan Cai <[email protected]> wrote:
>>>>>>>>> With WatcherService, when events are missed (which is to be 
>>>>>>>>> expected), you will still need to list the files. It seems to me that 
>>>>>>>>> WatcherService doesn't offer significant benefits in this case.
>>>>>>>>  
>>>>>>>> Yeah I think we leave it out eventually.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Regarding listing directory with a refresh flag, my concern is the 
>>>>>>>>> potential for abuse. End-users might/could always refresh before 
>>>>>>>>> listing, which could undermine the purpose of caching. Perhaps 
>>>>>>>>> Jeremiah can provide more insight on this.
>>>>>>>> 
>>>>>>>> Well, by default, it would not be refreshed every single time. You 
>>>>>>>> would need to opt-in into that. If there is a shop which has users 
>>>>>>>> with a direct access to the disk of Cassandra nodes and they are 
>>>>>>>> removing data manually, I do not know what to say, what is nodetool 
>>>>>>>> clearsnapshot and jmx methods good for then? I do not think we can 
>>>>>>>> prevent people from shooting into their feet if they are absolutely 
>>>>>>>> willing to do that.
>>>>>>>> 
>>>>>>>> If they want to refresh that every time, that would be equal to the 
>>>>>>>> current behavior. It would be at most as "bad" as it is now.
>>>>>>>>  
>>>>>>>>> IMO, caching is best handled internally. I have a few UX-related 
>>>>>>>>> questions:
>>>>>>>>> - Is it valid or acceptable to return stale data? If so, end-users 
>>>>>>>>> have to do some form of validation before consuming each snapshot to 
>>>>>>>>> account for potential deletions. 
>>>>>>>> 
>>>>>>>> answer below
>>>>>>>> 
>>>>>>>>> - Even if listsnapshot returns the most recent data, is it possible 
>>>>>>>>> that some of the directories get deleted when end-users are accessing 
>>>>>>>>> them? I think it is true. It, then, enforces end-users to do some 
>>>>>>>>> validation first, similar to handling stale data. 
>>>>>>>> 
>>>>>>>> I think that what you were trying to say is that when at time T0 
>>>>>>>> somebody lists snapshots and at T1 somebody removes a snapshot 
>>>>>>>> manually then the list of snapshots is not actual anymore? Sure. That 
>>>>>>>> is a thing. This is how it currently works. 
>>>>>>>> 
>>>>>>>> Now, we want to cache them, so if you clear a snapshot which is not 
>>>>>>>> physically there because somebody removed it manually, that should be 
>>>>>>>> a no-op, it will be just removed from the internal tracker. So, if it 
>>>>>>>> is at disk and in cache and you clear it, then all is fine. It is fine 
>>>>>>>> too if it is not on disk anymore and you clear it, then it is just 
>>>>>>>> removed internally. It would fail only in case you want to remove a 
>>>>>>>> snapshot which is not cached, regardless whether it is on disk or not. 
>>>>>>>> The deletion of non-existing snapshot ends up with a failure, nothing 
>>>>>>>> should be changed in that regard, this is the current behavior too.
>>>>>>>> 
>>>>>>>> I want to say that I did not write it completely correctly at the very 
>>>>>>>> beginning of this thread. Currently, we are caching only _expiring_ 
>>>>>>>> snapshots, because we need to know what is their time of removal so we 
>>>>>>>> act on it later. _normal_ snapshots are _not_ cached _yet_. I spent so 
>>>>>>>> much time with 18111 that I live in a reality where it is already in, 
>>>>>>>> I forgot this is not actually in place yet, we are very close to that.
>>>>>>>>  
>>>>>>>> OK thank you all for your insights, I will NOT use inotify.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Just my 2 cents. 
>>>>>>>>> 
>>>>>>>>> - Yifan
>>>>>>>>> 
>>>>>>>>> On Wed, Aug 7, 2024 at 6:03 AM Štefan Miklošovič 
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>> Yes, for example as reported here
>>>>>>>>>> 
>>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-13338
>>>>>>>>>> 
>>>>>>>>>> People who are charting this in monitoring dashboards might also hit 
>>>>>>>>>> this.
>>>>>>>>>> 
>>>>>>>>>> On Wed, Aug 7, 2024 at 2:59 PM J. D. Jordan 
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>> If you have a lot of snapshots and have for example a metric 
>>>>>>>>>>> monitoring them and their sizes, if you don’t cache it, creating 
>>>>>>>>>>> the metric can cause performance degradation. We added the cache 
>>>>>>>>>>> because we saw this happen to databases more than once.
>>>>>>>>>>> 
>>>>>>>>>>> > On Aug 7, 2024, at 7:54 AM, Josh McKenzie <[email protected]> 
>>>>>>>>>>> > wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > 
>>>>>>>>>>> >>
>>>>>>>>>>> >> Snapshot metadata are currently stored in memory / they are 
>>>>>>>>>>> >> cached so we do not need to go to disk every single time we want 
>>>>>>>>>>> >> to list them, the more snapshots we have, the worse it is.
>>>>>>>>>>> > Are we enumerating our snapshots somewhere on the hot path, or is 
>>>>>>>>>>> > this performance concern misplaced?
>>>>>>>>>>> >
>>>>>>>>>>> >> On Wed, Aug 7, 2024, at 7:44 AM, Štefan Miklošovič wrote:
>>>>>>>>>>> >> Snapshot metadata are currently stored in memory / they are 
>>>>>>>>>>> >> cached so we do not need to go to disk every single time we want 
>>>>>>>>>>> >> to list them, the more snapshots we have, the worse it is.
>>>>>>>>>>> >>
>>>>>>>>>>> >> When a snapshot is _manually_ removed from disk, not from 
>>>>>>>>>>> >> nodetool clearsnapshot, just by rm -rf on a respective snapshot 
>>>>>>>>>>> >> directory, then such snapshot will be still visible in nodetool 
>>>>>>>>>>> >> listsnapshots. Manual removal of a snapshot might be done e.g. 
>>>>>>>>>>> >> by accident or by some "impatient" operator who just goes to 
>>>>>>>>>>> >> disk and removes it there instead of using nodetool or 
>>>>>>>>>>> >> respective JMX method.
>>>>>>>>>>> >>
>>>>>>>>>>> >> To improve UX here, what I came up with is that we might use 
>>>>>>>>>>> >> Java's WatchService where each snapshot dir would be registered. 
>>>>>>>>>>> >> WatchService is part of Java, it uses inotify subsystem which is 
>>>>>>>>>>> >> what Linux kernel offers. The result of doing it is that once a 
>>>>>>>>>>> >> snapshot dir is registered to be watched and when it is removed 
>>>>>>>>>>> >> then we are notified about that via inotify / WatchService so we 
>>>>>>>>>>> >> can react on it and remove the in-memory representation of that 
>>>>>>>>>>> >> so it will not be visible in the output anymore.
>>>>>>>>>>> >>
>>>>>>>>>>> >> While this works, there are some questions / concerns
>>>>>>>>>>> >>
>>>>>>>>>>> >> 1) What do people think about inotify in general? I tested this 
>>>>>>>>>>> >> on 10k snapshots and it seems to work just fine, nevertheless 
>>>>>>>>>>> >> there is in general no strong guarantee that every single event 
>>>>>>>>>>> >> will come through, there is also a family of kernel parameters 
>>>>>>>>>>> >> around this where more tuning can be done etc. It is also 
>>>>>>>>>>> >> questionable how this will behave on other systems from Linux 
>>>>>>>>>>> >> (Mac etc). While JRE running on different platforms also 
>>>>>>>>>>> >> implements this, I am not completely sure these implementations 
>>>>>>>>>>> >> are quality-wise the same as for Linux etc. There is a history 
>>>>>>>>>>> >> of not-so-quality implementations for other systems (events not 
>>>>>>>>>>> >> coming through on Macs etc) and while I think we are safe on 
>>>>>>>>>>> >> Linux, I am not sure we want to go with this elsewhere.
>>>>>>>>>>> >>
>>>>>>>>>>> >> 2) inotify brings more entropy into the codebase, it is another 
>>>>>>>>>>> >> thing we need to take care of etc (however, it is all 
>>>>>>>>>>> >> concentrated in one class and pretty much "isolated" from 
>>>>>>>>>>> >> everything else)
>>>>>>>>>>> >>
>>>>>>>>>>> >> I made this feature optional and it is turned off by default so 
>>>>>>>>>>> >> people need to explicitly opt-in into this so we are not forcing 
>>>>>>>>>>> >> it on anybody.
>>>>>>>>>>> >>
>>>>>>>>>>> >> If we do not want to go with inotify, another option would be to 
>>>>>>>>>>> >> have a background thread which would periodically check if a 
>>>>>>>>>>> >> manifest exists on a disk, if it does not, then a snapshot does 
>>>>>>>>>>> >> not either. While this works, what I do not like about this is 
>>>>>>>>>>> >> that the primary reason we moved it to memory was to bypass IO 
>>>>>>>>>>> >> as much as possible yet here we would introduce another check 
>>>>>>>>>>> >> which would go to disk, and this would be done periodically, 
>>>>>>>>>>> >> which beats the whole purpose. If an operator lists snapshots 
>>>>>>>>>>> >> once a week and there is a background check running every 10 
>>>>>>>>>>> >> minutes (for example), then the cummulative number of IO 
>>>>>>>>>>> >> operations migth be bigger than us just doing nothing at all. 
>>>>>>>>>>> >> For this reason, if we do not want to go with inotify, I would 
>>>>>>>>>>> >> also not implement any automatic background check. Instead of 
>>>>>>>>>>> >> that, there would be SnapshotManagerMbean#refresh() method which 
>>>>>>>>>>> >> would just forcibly reload all snapshots from scratch. I think 
>>>>>>>>>>> >> that manual deletion of snapshots is not something a user would 
>>>>>>>>>>> >> do regularly, snapshots are meant to be removed via nodetool or 
>>>>>>>>>>> >> JMX. If manual removal ever happens then in order to make it 
>>>>>>>>>>> >> synchronized again, the refreshing of these snapshots would be 
>>>>>>>>>>> >> required. There might be an additional flag in nodetool 
>>>>>>>>>>> >> listsnapshots, once specified, refreshing would be done, 
>>>>>>>>>>> >> otherwise not.
>>>>>>>>>>> >>
>>>>>>>>>>> >> How does this all sound to people?
>>>>>>>>>>> >>
>>>>>>>>>>> >> Regards
>>>>>>>>>>> >
>>>>>>>

Re: [DISCUSS] inotify for detection of manually removed snapshots

Reply via email to