Re: [DISCUSS] inotify for detection of manually removed snapshots

Bowen Song via dev Fri, 09 Aug 2024 03:01:19 -0700

Has anyone considered simply updating the documentation saying this?

"Removing the snapshot files directly from the filesystem may breakthings. Always use the `nodetool` command or JMX to remove snapshots."


On 09/08/2024 09:18, Štefan Miklošovič wrote:

If we consider caching it all to be too much, we might probably makecaching an option an admin would need to opt-in into? There might be aflag in cassandra.yaml, once enabled, it would be in memory, otherwiseit would just load it as it was so people can decide if caching isenough for them or they want to have it as it was before (would be bydefault set to as it was). This puts additional complexity intoSnapshotManager but it should be in general doable.

Let me know what you think, I would really like to have this resolved,18111 brings a lot of code cleanup and simplifies stuff a lot.

On Wed, Aug 7, 2024 at 11:30 PM Josh McKenzie <jmcken...@apache.org>wrote:

    If you have a lot of snapshots and have for example a metric
    monitoring them and their sizes, if you don’t cache it, creating
    the metric can cause performance degradation. We added the cache
    because we saw this happen to databases more than once.

    I mean, I believe you, I'm just surprised querying out metadata
    for files and basic computation is leading to hundreds of ms pause
    times even on systems with a lot of files. Aren't most / all of
    these values cached at the filesystem layer so we're basically
    just tomato / tomahto caching systems, either one we maintain or
    one the OS maintains?

    Or is there really just a count of files well outside what I'm
    thinking here?

    Anyway, not trying to cause a ruckus and make needless noise,
    trying to learn. ;)


    On Wed, Aug 7, 2024, at 3:20 PM, Štefan Miklošovič wrote:



    On Wed, Aug 7, 2024 at 6:39 PM Yifan Cai <yc25c...@gmail.com> wrote:

        With WatcherService, when events are missed (which is to be
        expected), you will still need to list the files. It seems to
        me that WatcherService doesn't offer significant benefits in
        this case.


    Yeah I think we leave it out eventually.


        Regarding listing directory with a refresh flag, my concern
        is the potential for abuse. End-users might/could always
        refresh before listing, which could undermine the purpose of
        caching. Perhaps Jeremiah can provide more insight on this.


    Well, by default, it would not be refreshed every single time.
    You would need to opt-in into that. If there is a shop which has
    users with a direct access to the disk of Cassandra nodes and
    they are removing data manually, I do not know what to say, what
    is nodetool clearsnapshot and jmx methods good for then? I do not
    think we can prevent people from shooting into their feet if they
    are absolutely willing to do that.

    If they want to refresh that every time, that would be equal to
    the current behavior. It would be at most as "bad" as it is now.

        IMO, caching is best handled internally. I have a few
        UX-related questions:
        - Is it valid or acceptable to return stale data? If so,
        end-users have to do some form of validation before consuming
        each snapshot to account for potential deletions.


    answer below

        - Even if listsnapshot returns the most recent data, is it
        possible that some of the directories get deleted when
        end-users are accessing them? I think it is true. It, then,
        enforces end-users to do some validation first, similar to
        handling stale data.


    I think that what you were trying to say is that when at time T0
    somebody lists snapshots and at T1 somebody removes a snapshot
    manually then the list of snapshots is not actual anymore? Sure.
    That is a thing. This is how it currently works.

    Now, we want to cache them, so if you clear a snapshot which is
    not physically there because somebody removed it manually, that
    should be a no-op, it will be just removed from the internal
    tracker. So, if it is at disk and in cache and you clear it, then
    all is fine. It is fine too if it is not on disk anymore and you
    clear it, then it is just removed internally. It would fail only
    in case you want to remove a snapshot which is not cached,
    regardless whether it is on disk or not. The deletion of
    non-existing snapshot ends up with a failure, nothing should be
    changed in that regard, this is the current behavior too.

    I want to say that I did not write it completely correctly at the
    very beginning of this thread. Currently, we are caching only
    _expiring_ snapshots, because we need to know what is their time
    of removal so we act on it later. _normal_ snapshots are _not_
    cached _yet_. I spent so much time with 18111 that I live in a
    reality where it is already in, I forgot this is not actually in
    place yet, we are very close to that.

    OK thank you all for your insights, I will NOT use inotify.


        Just my 2 cents.

        - Yifan

        On Wed, Aug 7, 2024 at 6:03 AM Štefan Miklošovič
        <smikloso...@apache.org> wrote:

            Yes, for example as reported here

            https://issues.apache.org/jira/browse/CASSANDRA-13338

            People who are charting this in monitoring dashboards
            might also hit this.

            On Wed, Aug 7, 2024 at 2:59 PM J. D. Jordan
            <jeremiah.jor...@gmail.com> wrote:

                If you have a lot of snapshots and have for example a
                metric monitoring them and their sizes, if you don’t
                cache it, creating the metric can cause performance
                degradation. We added the cache because we saw this
                happen to databases more than once.

                > On Aug 7, 2024, at 7:54 AM, Josh McKenzie
                <jmcken...@apache.org> wrote:
                >
                > 
                >>
                >> Snapshot metadata are currently stored in memory /
                they are cached so we do not need to go to disk every
                single time we want to list them, the more snapshots
                we have, the worse it is.
                > Are we enumerating our snapshots somewhere on the
                hot path, or is this performance concern misplaced?
                >
                >> On Wed, Aug 7, 2024, at 7:44 AM, Štefan Miklošovič
                wrote:
                >> Snapshot metadata are currently stored in memory /
                they are cached so we do not need to go to disk every
                single time we want to list them, the more snapshots
                we have, the worse it is.
                >>
                >> When a snapshot is _manually_ removed from disk,
                not from nodetool clearsnapshot, just by rm -rf on a
                respective snapshot directory, then such snapshot
                will be still visible in nodetool listsnapshots.
                Manual removal of a snapshot might be done e.g. by
                accident or by some "impatient" operator who just
                goes to disk and removes it there instead of using
                nodetool or respective JMX method.
                >>
                >> To improve UX here, what I came up with is that we
                might use Java's WatchService where each snapshot dir
                would be registered. WatchService is part of Java, it
                uses inotify subsystem which is what Linux kernel
                offers. The result of doing it is that once a
                snapshot dir is registered to be watched and when it
                is removed then we are notified about that via
                inotify / WatchService so we can react on it and
                remove the in-memory representation of that so it
                will not be visible in the output anymore.
                >>
                >> While this works, there are some questions / concerns
                >>
                >> 1) What do people think about inotify in general?
                I tested this on 10k snapshots and it seems to work
                just fine, nevertheless there is in general no strong
                guarantee that every single event will come through,
                there is also a family of kernel parameters around
                this where more tuning can be done etc. It is also
                questionable how this will behave on other systems
                from Linux (Mac etc). While JRE running on different
                platforms also implements this, I am not completely
                sure these implementations are quality-wise the same
                as for Linux etc. There is a history of
                not-so-quality implementations for other systems
                (events not coming through on Macs etc) and while I
                think we are safe on Linux, I am not sure we want to
                go with this elsewhere.
                >>
                >> 2) inotify brings more entropy into the codebase,
                it is another thing we need to take care of etc
                (however, it is all concentrated in one class and
                pretty much "isolated" from everything else)
                >>
                >> I made this feature optional and it is turned off
                by default so people need to explicitly opt-in into
                this so we are not forcing it on anybody.
                >>
                >> If we do not want to go with inotify, another
                option would be to have a background thread which
                would periodically check if a manifest exists on a
                disk, if it does not, then a snapshot does not
                either. While this works, what I do not like about
                this is that the primary reason we moved it to memory
                was to bypass IO as much as possible yet here we
                would introduce another check which would go to disk,
                and this would be done periodically, which beats the
                whole purpose. If an operator lists snapshots once a
                week and there is a background check running every 10
                minutes (for example), then the cummulative number of
                IO operations migth be bigger than us just doing
                nothing at all. For this reason, if we do not want to
                go with inotify, I would also not implement any
                automatic background check. Instead of that, there
                would be SnapshotManagerMbean#refresh() method which
                would just forcibly reload all snapshots from
                scratch. I think that manual deletion of snapshots is
                not something a user would do regularly, snapshots
                are meant to be removed via nodetool or JMX. If
                manual removal ever happens then in order to make it
                synchronized again, the refreshing of these snapshots
                would be required. There might be an additional flag
                in nodetool listsnapshots, once specified, refreshing
                would be done, otherwise not.
                >>
                >> How does this all sound to people?
                >>
                >> Regards
                >

Re: [DISCUSS] inotify for detection of manually removed snapshots

Reply via email to