> "force" can be true / false. When true, then it will not check if a snapshot > exists
My mental node for “force” was only to deal with down nodes, and nothing to do with snapshots… so this feels like a weird behavior @Option(title = "force", name = {"-force", "--force"}, description = "Use -force to filter out down endpoints”) Our documentation even only calls out dealing with downed enodes… > a) we will remove existing snapshot and take a new one So this is racy and can make repairs brittle. The snapshot is at the RepairSession (global) or RepairJob level (a subset of ranges for a single table, different RepairJobs can work on the same table, just different ranges). With this code path you also have 1 more variable at play that makes things complex: isGlobal public boolean isGlobal() {return dataCenters.isEmpty() && hosts.isEmpty();} Global does the full range, where as non-global covers the partial range So when you do host or dc this snapshot is isolated to the RepairJob and not the session logically, but physically its isolated to the session; which doesn’t make any sense. > 2) if we do not want to fix "force", then it is most probably just redundant > and we would remove it but it order to not fail, we would need to go back to > naming snapshots not parent id / parent id but parent id / session id for > global and non-global repair respectively (basically we would return to > pre-14116 behavior). org.apache.cassandra.repair.RepairJobDesc#determanisticId This is scoped to a single RepairJob, so wouldn’t have the issues with concurrency. So if we were going to alter what the snapshot name is, I would strongly prefer this (its also present in the repair vtable, so not some hidden UUID) I personally feel the lowest risk is to switch to the job name and away from the session name we had… I do wonder about removing this “force” semantic as its not documented to users as far as I can tell, so is there any place that defines this behavior? > we might just fix the trunk and keep the rest broken, not ideal but With some patches its hard to tell if something is a bug or a feature… so altering the semantic is that making an improvement to allow better combinations of repair options, or is it fixing a bug (aka worthy of a back port)…. I know most users do parallel repair (default) so this snapshot logic doesn’t impact most users, but its also why its been so long before this issue got noticed… so coverage around snapshots currently isn’t the best. Sorry for this reply, its less of an answer and more a dump while I think through the problem you stated…. > On May 12, 2025, at 9:18 AM, Štefan Miklošovič <smikloso...@apache.org> wrote: > > Hello, > > I am looking for advice for (1). There is quite a quality archaeology to go > through. I have asked around already (thanks for that!) but nobody seems to > ultimately remember / tell what it is for. > > There is (2). "force" can be true / false. When true, then it will not check > if a snapshot exists. If it exists and "force" is true, this has never worked > - that is what CASSANDRA-20490 is for. It would try to hard link twice and it > would error out. > > This "force" logic is called here (3). If ParentRepairSession (prs) is > global, "force" is false, but if it is not (not global means that some dc was > specified), then it is true. > > Notice that both branches of that if/else use desc.parentSessionId.toString() > as snapshots' name. > > This "force" logic was introduced in (4) but it seems to me that it never > worked. AFAICT this was just broken on delivery. > > One commit before (4), it looked like (5). The first branch of "if" used > parentSessionId as a snapshot name and it checked if it exists or not before > taking it, the else branch took sessionId (not parent session id) and took a > snapshot of that name without any check if it exists or not. > > From javadoc on TableRepairManager, "force" means: (introduced in (4) > CASSANDRA-14116) > > "If force is true, a snapshot should be taken even if one already exists with > that name.". > > Questions: > > 1) do we want to fix the behaviour of "force"? (patch 20490) > if the answer is yes, what is the "semantics" about this. What does "force" > mean exactly? It can mean > a) we will remove existing snapshot and take a new one > b) we will add more SSTables onto existing snapshot, we just hardlink more > SSTables into same snapshot (that is what 20490 is doing) > > 2) if we do not want to fix "force", then it is most probably just redundant > and we would remove it but it order to not fail, we would need to go back to > naming snapshots not parent id / parent id but parent id / session id for > global and non-global repair respectively (basically we would return to > pre-14116 behavior). > > I think the introduction of "force" flag was there because there might be > multiple calls to "else" branch in RepairMessageVerbHandler and calling it > all over again with parentSessionId would be erroneous when already there, so > "force" was there for not caring and just taking a snapshot again. If we go > back to sessionId only, then we do not have this problem I guess. > > Is it important for people to keep the name of a snapshot in both cases > (global / not global) equal to parentSessionId? Can we just go back? > > If we want to keep naming snapshots as done now, we might just do 20490, I > was trying to fix older branches as well but it is very difficult, we might > just fix the trunk and keep the rest broken, not ideal but ... I am just > reinventing a wheel in 4.0+, it is just too cumbersome / complex to copy > trunk's behaviour in 20490 to older stuff. > > Thanks and regards > > (1) https://issues.apache.org/jira/browse/CASSANDRA-20490 > (2) > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/repair/CassandraTableRepairManager.java#L84 > (3) > https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L185-L192 > (4) https://issues.apache.org/jira/browse/CASSANDRA-14116 > (5) > https://github.com/apache/cassandra/blob/478c1a9fdf027af0f373f9e26521803ae8775fdb/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L106-L121