Giannis Economou wrote at about 23:18:25 +0200 on Thursday, March 16, 2023: > Thank you for the extra useful "find" command. > I see from your command: those extra collisions are not 32chars files > but 35 due to the "ext" (extension), but without _ separator. > > Based on that, I did my investigation. > > Some findings below... > > A. First BackupPC server: > This is a backup server running since Jan 2023 (replaced other bpc > servers running for years). > Status now: > - Pool is 10311.16GiB comprising 50892587 files and 16512 directories > (as of 2023-03-16 13:23), > - Pool hashing gives 5 repeated files with longest chain 1, > - Nightly cleanup removed 3100543 files of size 1206.32GiB (around > 2023-03-16 13:23), > > From those 5 cpool files, I investigated two of them (using the nice > wiki page here > https://github.com/backuppc/backuppc/wiki/How-to-find-which-backups-reference-a-particular-pool-file) > Each file of those two under investigation, was referenced in different > hosts and different individual backups. > > For both cpool files investigated, the actual file on disk was *exactly* > the same file. > So we talk about a file in different hosts, in different paths, > different dates, but having the same filename and same content. > So practically I understand that we have deduplications that did not > happen for some reason.
Either there is some rare edge or condition in BackupPC causing de-duplication to fail (which I think is unlikely) or perhaps at some point your backuppc instance crashed in the middle of saving one of the above pool files in some indeterminate state that caused it to create a duplicate in the next backup. Did you by any chance change the compression level? (though I don't think this will break de-duplication) What is the creation date of each of the chain elements? My Occam's razor guess would be that all 5 were created as part of the same backup. > > B. Second BackupPC server: > This is a brand new backup server, running only for 3 days until today, > not even a week youg (also replaced other bpc servers running for years). > Therefore we only have a few backups there until today, more will add up > as days go by. > Status now: > - Pool is 2207.69GiB comprising 26161662 files and 12384 directories (as > of 2023-03-16 13:11), > - Pool hashing gives 275 repeated files with longest chain 1, > - Nightly cleanup removed 0 files of size 0.00GiB (around 2023-03-16 13:11), > > What I see here is a different situation... > > 1. Count of files found by "find" is different that the collisions > reported in status page (find counts 231, bpc status page counts 275) > 2. Many pool files among the list from "find", are not referenced at all > in poolCnt files (give a count of 0 in their "poolCnt") > 3. Even worse, many cpool/*/poolCnt files are completely missing from > the cpool/. > For example I have a colission file: > cpool/ce/b0/ceb1e6764a28b208d51a7801052118d701 > but file: cpool/ce/poolCnt > does not even exist. > > Maybe refCount is not completed yet (because server is still new)? > > One guess is that this might happen because I have: > BackupPCNightlyPeriod: 4 > PoolSizeNightlyUpdatePeriod: 16 > PoolNightlyDigestCheckPercent: 1 > and the server is still new. > > I will wait for a few days. > (BTW, storage is local and everything seems error free from day-1 on the > system) > > If such inconsistencies persist, I guess I will have to investigate > "BackupPC_fsck" for this server. > > As you suggest, it's really hard to know whether there is an issue when you have not yet run a full BackupPC_nightly. That being said, why do you divide it up over so many days? Is your backup set that *large* that it can't complete in fewer nights? > > > On 16/3/2023 8:31 μ.μ., backu...@kosowsky.org wrote: > > Giannis Economou wrote at about 19:12:56 +0200 on Thursday, March 16, 2023: > > > In my v4 pool are collisions still generating _0, _1, _2 etc filenames > > > in the pool/ ? > > > > According to the code in Lib.pm, it appears that unlike v3, there is > > no underscore -- it's just an (unsigned) long added to the end of the > > 16 byte digest. > > > > > > > > (as in the example from the docs mentions: > > > __TOPDIR__/pool/1/2/3/123456789abcdef0 > > > __TOPDIR__/pool/1/2/3/123456789abcdef0_0 > > > __TOPDIR__/pool/1/2/3/123456789abcdef0_1 > > > ) > > > > That is for v3 as indicated by the 3-layer pool. > > > > > > > > I am using compression (I only have cpool/ dir) and I am asking because > > > on both servers running: > > > find cpool/ -name "*_0" -print > > > find cpool/ -name "*_*" -print > > > > > > brings zero results. > > > > Try: > > > > find /var/lib/backuppc/cpool/ -type f -regextype grep ! -regex > > ".*/[0-9a-f]\{32\}" ! -name "LOCK" ! -name "poolCnt" > > > > > > > > > > > Thank you. > > > > > > > > > On 16/3/2023 6:30 μ.μ., backu...@kosowsky.org wrote: > > > > Rob Sheldon wrote at about 08:31:17 -0700 on Thursday, March 16, > > 2023: > > > > > On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote: > > > > > > > > > > > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March > > 15, 2023: > > > > > > > There is no reason to be concerned. This is normal. > > > > > > > > > > > > It *should* be extremely, once-in-a-blue-moon, rare to > > randomly have an > > > > > > md5sum collision -- as in 1.47*10^-29 > > > > > > > > > > Why are you assuming this is "randomly" happening? Any time an > > identical file exists in more than one place on the client filesystem, > > there will be a collision. This is common in lots of cases. Desktop > > environments frequently have duplicated files scattered around. I used > > BackupPC for website backups; my chain length was approximately equal to > > the number of WordPress sites I was hosting. > > > > > > > > You are simply not understanding how file de-duplication and pool > > > > chains work in v4. > > > > > > > > Identical files contribute only a single chain instance -- no matter > > > > how many clients you are backing up and no matter how many backups > > you > > > > save of each client. This is what de-duplication does. > > > > > > > > The fact that they appear on different clients and/or in different > > > > parts of the filesystem is reflected in the attrib files in the pc > > > > subdirectories for each client. This is where the metadata is stored. > > > > > > > > Chain lengths have to do with pool storage of the file contents > > > > (ignoring metadata). Lengths greater than 1 only occur if you have > > > > md5sum hash collisions -- i.e., two files (no matter on what client > > or > > > > where in the filesystem) with non-identical contents but the same > > > > md5sum hash. > > > > > > > > Such collisions are statistically exceedingly unlikely to occur on > > > > normal data where you haven't worked hard to create such collisions. > > > > > > > > For example, on my backup server: > > > > Pool is 841.52+0.00GiB comprising 7395292+0 files and 16512+1 > > directories (as of 2023-03-16 01:11), > > > > Pool hashing gives 0+0 repeated files with longest chain 0+0, > > > > > > > > I strongly suggest you read the documentation on BackupPC before > > > > making wildly erroneous assumptions about chains. You can also look > > at > > > > the code in BackupPC_refCountUpdate which defines how $fileCntRep and > > > > $fileCntRepMax are calculated. > > > > > > > > Also, if what you said were true, the OP would have multiple chains - > > > > presumably one for each distinct file that is "scattered around" > > > > > > > > If you are using v4.x and have pool hashing with such collisions, it > > > > would be great to see them. I suspect you are either using v3 or you > > > > are using v4 with a legacy v3 pool > > > > > > > > > > You would have to work hard to artificially create such > > collisions. > > > > > > > > > > $ echo 'hello world' > ~/file_a > > > > > $ cp ~/file_a ~/file_b > > > > > $ [ "$(cat ~/file_a | md5sum)" = "$(cat ~/file_b | md5sum)" ] && > > echo "MATCH" > > > > > > > > > > _</email>_ > > > > > Rob Sheldon > > > > > Contract software developer, devops, security, technical lead > > > > > > > > > > > > > > > _______________________________________________ > > > > > BackupPC-users mailing list > > > > > BackupPC-users@lists.sourceforge.net > > > > > List: > > https://lists.sourceforge.net/lists/listinfo/backuppc-users > > > > > Wiki: https://github.com/backuppc/backuppc/wiki > > > > > Project: https://backuppc.github.io/backuppc/ > > > > > > > > > > > > _______________________________________________ > > > > BackupPC-users mailing list > > > > BackupPC-users@lists.sourceforge.net > > > > List: https://lists.sourceforge.net/lists/listinfo/backuppc-users > > > > Wiki: https://github.com/backuppc/backuppc/wiki > > > > Project: https://backuppc.github.io/backuppc/ > > > > > > > > > _______________________________________________ > > > BackupPC-users mailing list > > > BackupPC-users@lists.sourceforge.net > > > List: https://lists.sourceforge.net/lists/listinfo/backuppc-users > > > Wiki: https://github.com/backuppc/backuppc/wiki > > > Project: https://backuppc.github.io/backuppc/ > > > > > > _______________________________________________ > > BackupPC-users mailing list > > BackupPC-users@lists.sourceforge.net > > List: https://lists.sourceforge.net/lists/listinfo/backuppc-users > > Wiki: https://github.com/backuppc/backuppc/wiki > > Project: https://backuppc.github.io/backuppc/ > > > _______________________________________________ > BackupPC-users mailing list > BackupPC-users@lists.sourceforge.net > List: https://lists.sourceforge.net/lists/listinfo/backuppc-users > Wiki: https://github.com/backuppc/backuppc/wiki > Project: https://backuppc.github.io/backuppc/ _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: https://github.com/backuppc/backuppc/wiki Project: https://backuppc.github.io/backuppc/