Hi, I'd like to ask advice about a problem I've noticed recently concerning the tarmk backup.
At its core, the tarmk backup relies on a regular content diff. First backup doesn't find anyhting, copies all nodes over, second backup and on, diffs the content to incrementally apply the changes. One optimization of the tarmk diff is to check if the segment ids of 2 node states are the same, this makes for a really fast compareTo method. These 2 combined make for a fast and incremental backup, so far so good. Th problem I experienced comes in when there is enough content writes that a segment flush is triggered, so basically the same node, even unchanged ends up in a different segment, so with a different segment id. Now the backup fails to fast-match the node states and falls back to traversing of the content, to match-and-apply changes, except there are none. With time more and more segments are created, and as far as I can see nodes that have no changes migrate to different segments. All these migrations are seen as changes and generate content traversals. The reason this escalates is that the incremental backup will never update the segment ids on the target instance, it will only look at content, so an incremental backup will report more and more changes and traverse the repo content simply because the segments will restructure. thoughts? thanks, alex
