> I lean towards the documentation approach vs complicating the implementation. +1.
My question around the need for the optimization was purely driven by the motivation of exploring whether the complexity of caching this data was warranted. From talking with Jeremiah a bit offline, sounds like there's been some really egregious degradation cases and the minimal memory footprint and status quo complexity of implementation is a good enough solution. On Fri, Aug 9, 2024, at 1:38 PM, Jordan West wrote: > I lean towards the documentation approach vs complicating the implementation. > > For me personally: I regularly use shell commands to operate on snapshots. > That includes listing them. I probably should use nodetool for it all instead > though. > > Jordan > > On Fri, Aug 9, 2024 at 08:09 Štefan Miklošovič <smikloso...@apache.org> wrote: >> I understand and agree. It is just that it would be cool if we avoided the >> situation when there is a figurative ABC company which has these "bash >> scripts removing snapshots from cron by rm -rf every second Sunday at 3:00 >> am" because "that was their workflow for ages". >> >> I am particularly sensitive to this as Cassandra is very cautious when it >> comes to not disrupting the workflows already out there. >> >> I do not know how frequent this would be and if somebody started to >> complain. I mean ... they could still remove it by hand, right? It is just >> listsnapshots would not be relevant anymore without refreshing it. I think >> that might be acceptable. It would be something else if we flat out made >> manual deletion forbidden, which it is not. >> >> On Fri, Aug 9, 2024 at 4:50 PM Bowen Song via dev <dev@cassandra.apache.org> >> wrote: >>> __ >>> If we have the documentation in place, we can then consider the cache to be >>> the master copy of metadata, and rely on it to be always accurate and up to >>> date. If someone deletes the snapshot files from filesystem, they can't >>> complain about Cassandra stopped working correctly - which is the same if >>> they had manually deleted some SSTable files (they shouldn't). >>> >>> On 09/08/2024 11:16, Štefan Miklošovič wrote: >>>> We could indeed do that. Does your suggestion mean that there should not >>>> be a problem with caching it all once explicitly stated like that? >>>> >>>> On Fri, Aug 9, 2024 at 12:01 PM Bowen Song via dev >>>> <dev@cassandra.apache.org> wrote: >>>>> Has anyone considered simply updating the documentation saying this? >>>>> >>>>> "Removing the snapshot files directly from the filesystem may break >>>>> things. Always use the `nodetool` command or JMX to remove snapshots." >>>>> >>>>> On 09/08/2024 09:18, Štefan Miklošovič wrote: >>>>>> If we consider caching it all to be too much, we might probably make >>>>>> caching an option an admin would need to opt-in into? There might be a >>>>>> flag in cassandra.yaml, once enabled, it would be in memory, otherwise >>>>>> it would just load it as it was so people can decide if caching is >>>>>> enough for them or they want to have it as it was before (would be by >>>>>> default set to as it was). This puts additional complexity into >>>>>> SnapshotManager but it should be in general doable. >>>>>> >>>>>> Let me know what you think, I would really like to have this resolved, >>>>>> 18111 brings a lot of code cleanup and simplifies stuff a lot. >>>>>> >>>>>> On Wed, Aug 7, 2024 at 11:30 PM Josh McKenzie <jmcken...@apache.org> >>>>>> wrote: >>>>>>>> If you have a lot of snapshots and have for example a metric >>>>>>>> monitoring them and their sizes, if you don’t cache it, creating the >>>>>>>> metric can cause performance degradation. We added the cache because >>>>>>>> we saw this happen to databases more than once. >>>>>>> I mean, I believe you, I'm just surprised querying out metadata for >>>>>>> files and basic computation is leading to hundreds of ms pause times >>>>>>> even on systems with a lot of files. Aren't most / all of these values >>>>>>> cached at the filesystem layer so we're basically just tomato / tomahto >>>>>>> caching systems, either one we maintain or one the OS maintains? >>>>>>> >>>>>>> Or is there really just a count of files well outside what I'm thinking >>>>>>> here? >>>>>>> >>>>>>> Anyway, not trying to cause a ruckus and make needless noise, trying to >>>>>>> learn. ;) >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 7, 2024, at 3:20 PM, Štefan Miklošovič wrote: >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Aug 7, 2024 at 6:39 PM Yifan Cai <yc25c...@gmail.com> wrote: >>>>>>>>> With WatcherService, when events are missed (which is to be >>>>>>>>> expected), you will still need to list the files. It seems to me that >>>>>>>>> WatcherService doesn't offer significant benefits in this case. >>>>>>>> >>>>>>>> Yeah I think we leave it out eventually. >>>>>>>> >>>>>>>>> >>>>>>>>> Regarding listing directory with a refresh flag, my concern is the >>>>>>>>> potential for abuse. End-users might/could always refresh before >>>>>>>>> listing, which could undermine the purpose of caching. Perhaps >>>>>>>>> Jeremiah can provide more insight on this. >>>>>>>> >>>>>>>> Well, by default, it would not be refreshed every single time. You >>>>>>>> would need to opt-in into that. If there is a shop which has users >>>>>>>> with a direct access to the disk of Cassandra nodes and they are >>>>>>>> removing data manually, I do not know what to say, what is nodetool >>>>>>>> clearsnapshot and jmx methods good for then? I do not think we can >>>>>>>> prevent people from shooting into their feet if they are absolutely >>>>>>>> willing to do that. >>>>>>>> >>>>>>>> If they want to refresh that every time, that would be equal to the >>>>>>>> current behavior. It would be at most as "bad" as it is now. >>>>>>>> >>>>>>>>> IMO, caching is best handled internally. I have a few UX-related >>>>>>>>> questions: >>>>>>>>> - Is it valid or acceptable to return stale data? If so, end-users >>>>>>>>> have to do some form of validation before consuming each snapshot to >>>>>>>>> account for potential deletions. >>>>>>>> >>>>>>>> answer below >>>>>>>> >>>>>>>>> - Even if listsnapshot returns the most recent data, is it possible >>>>>>>>> that some of the directories get deleted when end-users are accessing >>>>>>>>> them? I think it is true. It, then, enforces end-users to do some >>>>>>>>> validation first, similar to handling stale data. >>>>>>>> >>>>>>>> I think that what you were trying to say is that when at time T0 >>>>>>>> somebody lists snapshots and at T1 somebody removes a snapshot >>>>>>>> manually then the list of snapshots is not actual anymore? Sure. That >>>>>>>> is a thing. This is how it currently works. >>>>>>>> >>>>>>>> Now, we want to cache them, so if you clear a snapshot which is not >>>>>>>> physically there because somebody removed it manually, that should be >>>>>>>> a no-op, it will be just removed from the internal tracker. So, if it >>>>>>>> is at disk and in cache and you clear it, then all is fine. It is fine >>>>>>>> too if it is not on disk anymore and you clear it, then it is just >>>>>>>> removed internally. It would fail only in case you want to remove a >>>>>>>> snapshot which is not cached, regardless whether it is on disk or not. >>>>>>>> The deletion of non-existing snapshot ends up with a failure, nothing >>>>>>>> should be changed in that regard, this is the current behavior too. >>>>>>>> >>>>>>>> I want to say that I did not write it completely correctly at the very >>>>>>>> beginning of this thread. Currently, we are caching only _expiring_ >>>>>>>> snapshots, because we need to know what is their time of removal so we >>>>>>>> act on it later. _normal_ snapshots are _not_ cached _yet_. I spent so >>>>>>>> much time with 18111 that I live in a reality where it is already in, >>>>>>>> I forgot this is not actually in place yet, we are very close to that. >>>>>>>> >>>>>>>> OK thank you all for your insights, I will NOT use inotify. >>>>>>>> >>>>>>>>> >>>>>>>>> Just my 2 cents. >>>>>>>>> >>>>>>>>> - Yifan >>>>>>>>> >>>>>>>>> On Wed, Aug 7, 2024 at 6:03 AM Štefan Miklošovič >>>>>>>>> <smikloso...@apache.org> wrote: >>>>>>>>>> Yes, for example as reported here >>>>>>>>>> >>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-13338 >>>>>>>>>> >>>>>>>>>> People who are charting this in monitoring dashboards might also hit >>>>>>>>>> this. >>>>>>>>>> >>>>>>>>>> On Wed, Aug 7, 2024 at 2:59 PM J. D. Jordan >>>>>>>>>> <jeremiah.jor...@gmail.com> wrote: >>>>>>>>>>> If you have a lot of snapshots and have for example a metric >>>>>>>>>>> monitoring them and their sizes, if you don’t cache it, creating >>>>>>>>>>> the metric can cause performance degradation. We added the cache >>>>>>>>>>> because we saw this happen to databases more than once. >>>>>>>>>>> >>>>>>>>>>> > On Aug 7, 2024, at 7:54 AM, Josh McKenzie <jmcken...@apache.org> >>>>>>>>>>> > wrote: >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> >> >>>>>>>>>>> >> Snapshot metadata are currently stored in memory / they are >>>>>>>>>>> >> cached so we do not need to go to disk every single time we want >>>>>>>>>>> >> to list them, the more snapshots we have, the worse it is. >>>>>>>>>>> > Are we enumerating our snapshots somewhere on the hot path, or is >>>>>>>>>>> > this performance concern misplaced? >>>>>>>>>>> > >>>>>>>>>>> >> On Wed, Aug 7, 2024, at 7:44 AM, Štefan Miklošovič wrote: >>>>>>>>>>> >> Snapshot metadata are currently stored in memory / they are >>>>>>>>>>> >> cached so we do not need to go to disk every single time we want >>>>>>>>>>> >> to list them, the more snapshots we have, the worse it is. >>>>>>>>>>> >> >>>>>>>>>>> >> When a snapshot is _manually_ removed from disk, not from >>>>>>>>>>> >> nodetool clearsnapshot, just by rm -rf on a respective snapshot >>>>>>>>>>> >> directory, then such snapshot will be still visible in nodetool >>>>>>>>>>> >> listsnapshots. Manual removal of a snapshot might be done e.g. >>>>>>>>>>> >> by accident or by some "impatient" operator who just goes to >>>>>>>>>>> >> disk and removes it there instead of using nodetool or >>>>>>>>>>> >> respective JMX method. >>>>>>>>>>> >> >>>>>>>>>>> >> To improve UX here, what I came up with is that we might use >>>>>>>>>>> >> Java's WatchService where each snapshot dir would be registered. >>>>>>>>>>> >> WatchService is part of Java, it uses inotify subsystem which is >>>>>>>>>>> >> what Linux kernel offers. The result of doing it is that once a >>>>>>>>>>> >> snapshot dir is registered to be watched and when it is removed >>>>>>>>>>> >> then we are notified about that via inotify / WatchService so we >>>>>>>>>>> >> can react on it and remove the in-memory representation of that >>>>>>>>>>> >> so it will not be visible in the output anymore. >>>>>>>>>>> >> >>>>>>>>>>> >> While this works, there are some questions / concerns >>>>>>>>>>> >> >>>>>>>>>>> >> 1) What do people think about inotify in general? I tested this >>>>>>>>>>> >> on 10k snapshots and it seems to work just fine, nevertheless >>>>>>>>>>> >> there is in general no strong guarantee that every single event >>>>>>>>>>> >> will come through, there is also a family of kernel parameters >>>>>>>>>>> >> around this where more tuning can be done etc. It is also >>>>>>>>>>> >> questionable how this will behave on other systems from Linux >>>>>>>>>>> >> (Mac etc). While JRE running on different platforms also >>>>>>>>>>> >> implements this, I am not completely sure these implementations >>>>>>>>>>> >> are quality-wise the same as for Linux etc. There is a history >>>>>>>>>>> >> of not-so-quality implementations for other systems (events not >>>>>>>>>>> >> coming through on Macs etc) and while I think we are safe on >>>>>>>>>>> >> Linux, I am not sure we want to go with this elsewhere. >>>>>>>>>>> >> >>>>>>>>>>> >> 2) inotify brings more entropy into the codebase, it is another >>>>>>>>>>> >> thing we need to take care of etc (however, it is all >>>>>>>>>>> >> concentrated in one class and pretty much "isolated" from >>>>>>>>>>> >> everything else) >>>>>>>>>>> >> >>>>>>>>>>> >> I made this feature optional and it is turned off by default so >>>>>>>>>>> >> people need to explicitly opt-in into this so we are not forcing >>>>>>>>>>> >> it on anybody. >>>>>>>>>>> >> >>>>>>>>>>> >> If we do not want to go with inotify, another option would be to >>>>>>>>>>> >> have a background thread which would periodically check if a >>>>>>>>>>> >> manifest exists on a disk, if it does not, then a snapshot does >>>>>>>>>>> >> not either. While this works, what I do not like about this is >>>>>>>>>>> >> that the primary reason we moved it to memory was to bypass IO >>>>>>>>>>> >> as much as possible yet here we would introduce another check >>>>>>>>>>> >> which would go to disk, and this would be done periodically, >>>>>>>>>>> >> which beats the whole purpose. If an operator lists snapshots >>>>>>>>>>> >> once a week and there is a background check running every 10 >>>>>>>>>>> >> minutes (for example), then the cummulative number of IO >>>>>>>>>>> >> operations migth be bigger than us just doing nothing at all. >>>>>>>>>>> >> For this reason, if we do not want to go with inotify, I would >>>>>>>>>>> >> also not implement any automatic background check. Instead of >>>>>>>>>>> >> that, there would be SnapshotManagerMbean#refresh() method which >>>>>>>>>>> >> would just forcibly reload all snapshots from scratch. I think >>>>>>>>>>> >> that manual deletion of snapshots is not something a user would >>>>>>>>>>> >> do regularly, snapshots are meant to be removed via nodetool or >>>>>>>>>>> >> JMX. If manual removal ever happens then in order to make it >>>>>>>>>>> >> synchronized again, the refreshing of these snapshots would be >>>>>>>>>>> >> required. There might be an additional flag in nodetool >>>>>>>>>>> >> listsnapshots, once specified, refreshing would be done, >>>>>>>>>>> >> otherwise not. >>>>>>>>>>> >> >>>>>>>>>>> >> How does this all sound to people? >>>>>>>>>>> >> >>>>>>>>>>> >> Regards >>>>>>>>>>> > >>>>>>>