Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-17 Thread Giannis Economou



Either there is some rare edge or condition in BackupPC causing
de-duplication to fail (which I think is unlikely) or perhaps at some
point your backuppc instance crashed in the middle of saving one of
the above pool files in some indeterminate state that caused it to
create a duplicate in the next backup.

Did you by any chance change the compression level? (though I don't
think this will break de-duplication)

What is the creation date of each of the chain elements? My Occam's
razor guess would be that all 5 were created as part of the same backup.



We did change the compression level (but before doing that we have 
checked documentation and different compression levels are supposed to 
be full compatible).


Yes, all those files cpool files seem to be from the same wake up/run, 
here are their dates:
-rw-r- 1 backuppc backuppc 396 Jan  6 01:05 
cpool/e8/c4/e9c4eb5dbc31a6b94c8227921fe7f91001
-rw-r- 1 backuppc backuppc 332 Jan  6 01:05 
cpool/80/ac/80ad24bf7a216782015f19937ca2e0c801
-rw-r- 1 backuppc backuppc 337 Jan  6 01:21 
cpool/f4/5a/f45b2ef773ddec67f7de7a8097ec636901
-rw-r- 1 backuppc backuppc 283 Jan  6 01:05 
cpool/e2/78/e2797a05cc323d392bec223f4fc3d1c501
-rw-r- 1 backuppc backuppc 292 Jan  6 01:21 
cpool/52/20/52214a481a6dad3ec037c73bae5518bd01


Actually this server first started at 5-Jan , so 5-Jan and 6-Jan was 
among the first backups that this server took.




As you suggest, it's really hard to know whether there is an issue
when you have not yet run a full BackupPC_nightly.

That being said, why do you divide it up over so many days? Is your
backup set that *large* that it can't complete in fewer nights?


Yes, I will give it a few days/weeks to check again. Wait first to have 
first refCount completed.


2nd backup server sizing stats, when in a "steady" state, will be much 
like the stats of the 1st server.
This is about 70-80 hosts utilizing about 10 TB (in the BackupPC data 
folder size), keeping about 1300 backups in total (full and incrementals).


We divide our nightly in so many days, because we run nightly in work 
hours and the actual backups at night. We have one wakeup in the middle 
of the day that allows only nightly to run, all real backups are at night.
When nightly is running, BackupPC gets much slower. So when a restore is 
needed this can be a pain.
Running a restore of some GB (not many) while nightly is running (during 
work hours) can be really slow due to heavy I/O.
We expect our storage even on steady state to be about 50-60% full, so 
there is no rush to remove files.
Having fast restores on demand in business hours is definitely the 
priority here over emptying disk space some days earlier.



Thank you.



___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread backuppc
Giannis Economou wrote at about 23:18:25 +0200 on Thursday, March 16, 2023:
 > Thank you for the extra useful "find" command.
 > I see from your command: those extra collisions are not 32chars files 
 > but 35 due to the "ext" (extension), but without _ separator.
 > 
 > Based on that, I did my investigation.
 > 
 > Some findings below...
 > 
 > A. First BackupPC server:
 > This is a backup server running since Jan 2023 (replaced other bpc 
 > servers running for years).
 > Status now:
 > - Pool is 10311.16GiB comprising 50892587 files and 16512 directories 
 > (as of 2023-03-16 13:23),
 > - Pool hashing gives 5 repeated files with longest chain 1,
 > - Nightly cleanup removed 3100543 files of size 1206.32GiB (around 
 > 2023-03-16 13:23),
 > 
 >  From those 5 cpool files, I investigated two of them (using the nice 
 > wiki page here 
 > https://github.com/backuppc/backuppc/wiki/How-to-find-which-backups-reference-a-particular-pool-file)
 > Each file of those two under investigation, was referenced in different 
 > hosts and different individual backups.
 > 
 > For both cpool files investigated, the actual file on disk was *exactly* 
 > the same file.
 > So we talk about a file in different hosts, in different paths, 
 > different dates, but having the same filename and same content.
 > So practically I understand that we have deduplications that did not 
 > happen for some reason.

Either there is some rare edge or condition in BackupPC causing
de-duplication to fail (which I think is unlikely) or perhaps at some
point your backuppc instance crashed in the middle of saving one of
the above pool files in some indeterminate state that caused it to
create a duplicate in the next backup.

Did you by any chance change the compression level? (though I don't
think this will break de-duplication)

What is the creation date of each of the chain elements? My Occam's
razor guess would be that all 5 were created as part of the same backup.
 > 
 > B. Second BackupPC server:
 > This is a brand new backup server, running only for 3 days until today, 
 > not even a week youg (also replaced other bpc servers running for years).
 > Therefore we only have a few backups there until today, more will add up 
 > as days go by.
 > Status now:
 > - Pool is 2207.69GiB comprising 26161662 files and 12384 directories (as 
 > of 2023-03-16 13:11),
 > - Pool hashing gives 275 repeated files with longest chain 1,
 > - Nightly cleanup removed 0 files of size 0.00GiB (around 2023-03-16 13:11),
 > 
 > What I see here is a different situation...
 > 
 > 1. Count of files found by "find" is different that the collisions 
 > reported in status page (find counts 231, bpc status page counts 275)
 > 2. Many pool files among the list from "find", are not referenced at all 
 > in poolCnt files (give a count of 0 in their "poolCnt")
 > 3. Even worse, many cpool/*/poolCnt files are completely missing from 
 > the cpool/.
 > For example I have a colission file:
 > cpool/ce/b0/ceb1e6764a28b208d51a7801052118d701
 > but file: cpool/ce/poolCnt
 > does not even exist.
 > 
 > Maybe refCount is not completed yet (because server is still new)?
 > 
 > One guess is that this might happen because I have:
 >     BackupPCNightlyPeriod: 4
 >     PoolSizeNightlyUpdatePeriod: 16
 >     PoolNightlyDigestCheckPercent: 1
 > and the server is still new.
 > 
 > I will wait for a few days.
 > (BTW, storage is local and everything seems error free from day-1 on the 
 > system)
 > 
 > If such inconsistencies persist, I guess I will have to investigate 
 > "BackupPC_fsck" for this server.
 > 
 > 

As you suggest, it's really hard to know whether there is an issue
when you have not yet run a full BackupPC_nightly.

That being said, why do you divide it up over so many days? Is your
backup set that *large* that it can't complete in fewer nights?

 > 
 > 
 > On 16/3/2023 8:31 μ.μ., backu...@kosowsky.org wrote:
 > > Giannis Economou wrote at about 19:12:56 +0200 on Thursday, March 16, 2023:
 > >   > In my v4 pool are collisions still generating _0, _1, _2 etc filenames
 > >   > in the pool/ ?
 > >
 > > According to the code in Lib.pm, it appears that unlike v3, there is
 > > no underscore -- it's just an (unsigned) long added to the end of the
 > > 16 byte digest.
 > >
 > >   >
 > >   > (as in the example from the docs mentions:
 > >   >      __TOPDIR__/pool/1/2/3/123456789abcdef0
 > >   >      __TOPDIR__/pool/1/2/3/123456789abcdef0_0
 > >   >      __TOPDIR__/pool/1/2/3/123456789abcdef0_1
 > >   > )
 > >
 > > That is for v3 as indicated by the 3-layer pool.
 > >
 > >   >
 > >   > I am using compression (I only have cpool/ dir) and I am asking because
 > >   > on both servers running:
 > >   >          find cpool/ -name "*_0" -print
 > >   >          find cpool/ -name "*_*" -print
 > >   >
 > >   > brings zero results.
 > >
 > > Try:
 > >
 > >  find /var/lib/backuppc/cpool/ -type f -regextype grep ! -regex 
 > > ".*/[0-9a-f]\{32\}" ! -name "LOCK" ! -name 

Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Giannis Economou

Thank you for the extra useful "find" command.
I see from your command: those extra collisions are not 32chars files 
but 35 due to the "ext" (extension), but without _ separator.


Based on that, I did my investigation.

Some findings below...

A. First BackupPC server:
This is a backup server running since Jan 2023 (replaced other bpc 
servers running for years).

Status now:
- Pool is 10311.16GiB comprising 50892587 files and 16512 directories 
(as of 2023-03-16 13:23),

- Pool hashing gives 5 repeated files with longest chain 1,
- Nightly cleanup removed 3100543 files of size 1206.32GiB (around 
2023-03-16 13:23),


From those 5 cpool files, I investigated two of them (using the nice 
wiki page here 
https://github.com/backuppc/backuppc/wiki/How-to-find-which-backups-reference-a-particular-pool-file)
Each file of those two under investigation, was referenced in different 
hosts and different individual backups.


For both cpool files investigated, the actual file on disk was *exactly* 
the same file.
So we talk about a file in different hosts, in different paths, 
different dates, but having the same filename and same content.
So practically I understand that we have deduplications that did not 
happen for some reason.


B. Second BackupPC server:
This is a brand new backup server, running only for 3 days until today, 
not even a week youg (also replaced other bpc servers running for years).
Therefore we only have a few backups there until today, more will add up 
as days go by.

Status now:
- Pool is 2207.69GiB comprising 26161662 files and 12384 directories (as 
of 2023-03-16 13:11),

- Pool hashing gives 275 repeated files with longest chain 1,
- Nightly cleanup removed 0 files of size 0.00GiB (around 2023-03-16 13:11),

What I see here is a different situation...

1. Count of files found by "find" is different that the collisions 
reported in status page (find counts 231, bpc status page counts 275)
2. Many pool files among the list from "find", are not referenced at all 
in poolCnt files (give a count of 0 in their "poolCnt")
3. Even worse, many cpool/*/poolCnt files are completely missing from 
the cpool/.

For example I have a colission file:
cpool/ce/b0/ceb1e6764a28b208d51a7801052118d701
but file: cpool/ce/poolCnt
does not even exist.

Maybe refCount is not completed yet (because server is still new)?

One guess is that this might happen because I have:
   BackupPCNightlyPeriod: 4
   PoolSizeNightlyUpdatePeriod: 16
   PoolNightlyDigestCheckPercent: 1
and the server is still new.

I will wait for a few days.
(BTW, storage is local and everything seems error free from day-1 on the 
system)


If such inconsistencies persist, I guess I will have to investigate 
"BackupPC_fsck" for this server.



Thank you.


On 16/3/2023 8:31 μ.μ., backu...@kosowsky.org wrote:

Giannis Economou wrote at about 19:12:56 +0200 on Thursday, March 16, 2023:
  > In my v4 pool are collisions still generating _0, _1, _2 etc filenames
  > in the pool/ ?

According to the code in Lib.pm, it appears that unlike v3, there is
no underscore -- it's just an (unsigned) long added to the end of the
16 byte digest.

  >
  > (as in the example from the docs mentions:
  >      __TOPDIR__/pool/1/2/3/123456789abcdef0
  >      __TOPDIR__/pool/1/2/3/123456789abcdef0_0
  >      __TOPDIR__/pool/1/2/3/123456789abcdef0_1
  > )

That is for v3 as indicated by the 3-layer pool.

  >
  > I am using compression (I only have cpool/ dir) and I am asking because
  > on both servers running:
  >          find cpool/ -name "*_0" -print
  >          find cpool/ -name "*_*" -print
  >
  > brings zero results.

Try:

 find /var/lib/backuppc/cpool/ -type f -regextype grep ! -regex ".*/[0-9a-f]\{32\}" ! -name 
"LOCK" ! -name "poolCnt"

  >
  >
  > Thank you.
  >
  >
  > On 16/3/2023 6:30 μ.μ., backu...@kosowsky.org wrote:
  > > Rob Sheldon wrote at about 08:31:17 -0700 on Thursday, March 16, 2023:
  > >   > On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote:
  > >   > >
  > >   > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 
2023:
  > >   > > > There is no reason to be concerned. This is normal.
  > >   > >
  > >   > > It *should* be extremely, once-in-a-blue-moon, rare to randomly 
have an
  > >   > > md5sum collision -- as in 1.47*10^-29
  > >   >
  > >   > Why are you assuming this is "randomly" happening? Any time an 
identical file exists in more than one place on the client filesystem, there will be a collision. 
This is common in lots of cases. Desktop environments frequently have duplicated files scattered 
around. I used BackupPC for website backups; my chain length was approximately equal to the number 
of WordPress sites I was hosting.
  > >
  > > You are simply not understanding how file de-duplication and pool
  > > chains work in v4.
  > >
  > > Identical files contribute only a single chain instance -- no matter
  > > how many clients you are backing up and no matter how many backups you
  

Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread
Giannis Economou wrote at about 19:12:56 +0200 on Thursday, March 16, 2023:
 > In my v4 pool are collisions still generating _0, _1, _2 etc filenames 
 > in the pool/ ?

According to the code in Lib.pm, it appears that unlike v3, there is
no underscore -- it's just an (unsigned) long added to the end of the
16 byte digest.

 > 
 > (as in the example from the docs mentions:
 >      __TOPDIR__/pool/1/2/3/123456789abcdef0
 >      __TOPDIR__/pool/1/2/3/123456789abcdef0_0
 >      __TOPDIR__/pool/1/2/3/123456789abcdef0_1
 > )

That is for v3 as indicated by the 3-layer pool.

 > 
 > I am using compression (I only have cpool/ dir) and I am asking because 
 > on both servers running:
 >          find cpool/ -name "*_0" -print
 >          find cpool/ -name "*_*" -print
 > 
 > brings zero results.

Try:

find /var/lib/backuppc/cpool/ -type f -regextype grep ! -regex 
".*/[0-9a-f]\{32\}" ! -name "LOCK" ! -name "poolCnt"

 > 
 > 
 > Thank you.
 > 
 > 
 > On 16/3/2023 6:30 μ.μ., backu...@kosowsky.org wrote:
 > > Rob Sheldon wrote at about 08:31:17 -0700 on Thursday, March 16, 2023:
 > >   > On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote:
 > >   > >
 > >   > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 
 > > 2023:
 > >   > > > There is no reason to be concerned. This is normal.
 > >   > >
 > >   > > It *should* be extremely, once-in-a-blue-moon, rare to randomly have 
 > > an
 > >   > > md5sum collision -- as in 1.47*10^-29
 > >   >
 > >   > Why are you assuming this is "randomly" happening? Any time an 
 > > identical file exists in more than one place on the client filesystem, 
 > > there will be a collision. This is common in lots of cases. Desktop 
 > > environments frequently have duplicated files scattered around. I used 
 > > BackupPC for website backups; my chain length was approximately equal to 
 > > the number of WordPress sites I was hosting.
 > >
 > > You are simply not understanding how file de-duplication and pool
 > > chains work in v4.
 > >
 > > Identical files contribute only a single chain instance -- no matter
 > > how many clients you are backing up and no matter how many backups you
 > > save of each client. This is what de-duplication does.
 > >
 > > The fact that they appear on different clients and/or in different
 > > parts of the filesystem is reflected in the attrib files in the pc
 > > subdirectories for each client. This is where the metadata is stored.
 > >
 > > Chain lengths have to do with pool storage of the file contents
 > > (ignoring metadata). Lengths greater than 1 only occur if you have
 > > md5sum hash collisions -- i.e., two files (no matter on what client or
 > > where in the filesystem) with non-identical contents but the same
 > > md5sum hash.
 > >
 > > Such collisions are statistically exceedingly unlikely to occur on
 > > normal data where you haven't worked hard to create such collisions.
 > >
 > > For example, on my backup server:
 > >Pool is 841.52+0.00GiB comprising 7395292+0 files and 16512+1 
 > > directories (as of 2023-03-16 01:11),
 > >Pool hashing gives 0+0 repeated files with longest chain 0+0,
 > >
 > > I strongly suggest you read the documentation on BackupPC before
 > > making wildly erroneous assumptions about chains. You can also look at
 > > the code in BackupPC_refCountUpdate which defines how $fileCntRep and
 > > $fileCntRepMax are calculated.
 > >
 > > Also, if what you said were true, the OP would have multiple chains -
 > > presumably one for each distinct file that is "scattered around"
 > >
 > > If you are using v4.x and have pool hashing with such collisions, it
 > > would be great to see them. I suspect you are either using v3 or you
 > > are using v4 with a legacy v3 pool
 > >
 > >   > > You would have to work hard to artificially create such collisions.
 > >   >
 > >   > $ echo 'hello world' > ~/file_a
 > >   > $ cp ~/file_a ~/file_b
 > >   > $ [ "$(cat ~/file_a | md5sum)" = "$(cat ~/file_b | md5sum)" ] && echo 
 > > "MATCH"
 > >   >
 > >   > __
 > >   > Rob Sheldon
 > >   > Contract software developer, devops, security, technical lead
 > >   >
 > >   >
 > >   > ___
 > >   > BackupPC-users mailing list
 > >   > BackupPC-users@lists.sourceforge.net
 > >   > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > >   > Wiki:https://github.com/backuppc/backuppc/wiki
 > >   > Project: https://backuppc.github.io/backuppc/
 > >
 > >
 > > ___
 > > BackupPC-users mailing list
 > > BackupPC-users@lists.sourceforge.net
 > > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > > Wiki:https://github.com/backuppc/backuppc/wiki
 > > Project: https://backuppc.github.io/backuppc/
 > 
 > 
 > ___
 > BackupPC-users mailing list
 > BackupPC-users@lists.sourceforge.net
 > List:

Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread
Rob Sheldon wrote at about 09:56:51 -0700 on Thursday, March 16, 2023:
 > The mailing list tyrant's reply however made me realize I don't actually 
 > need to be chasing this down just right now, so I won't be following up on 
 > this further.
 >  I was dormant on this list for quite a few years -- which probably 
 > contributed to my misinformation -- and I suddenly have remembered why. :-)

Whatever.

- I help people but have high expectations that if I am going to
  invest time in helping people with answers then I in turn expect
  those seeking help to invest time and effort in troubleshooting the
  problem in advance and formulating a specific question. This is a
  sysadmin-level tool and I don't get paid to coddle users.

- You have no standards for how questions are formulated and then proceed
  repeatedly to give diametrically wrong advice. Garbage in, garbage
  out.

Similarly,
- I spent significant time today reviewing the code and verifying my
  understanding of BackupPC before answering

- You seemingly spent zero time researching the problem and just cited
  your vague understanding of how BackupPC works based on information
  that you admit is from "quite a few years" ago

Who is a more helpful contributor to the list?

Please put on some big boy pants...


> 
 > __
 > Rob Sheldon
 > Contract software developer, devops, security, technical lead
 > 
 > 
 > ___
 > BackupPC-users mailing list
 > BackupPC-users@lists.sourceforge.net
 > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > Wiki:https://github.com/backuppc/backuppc/wiki
 > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Giannis Economou
In my v4 pool are collisions still generating _0, _1, _2 etc filenames 
in the pool/ ?


(as in the example from the docs mentions:
    __TOPDIR__/pool/1/2/3/123456789abcdef0
    __TOPDIR__/pool/1/2/3/123456789abcdef0_0
    __TOPDIR__/pool/1/2/3/123456789abcdef0_1
)

I am using compression (I only have cpool/ dir) and I am asking because 
on both servers running:

        find cpool/ -name "*_0" -print
        find cpool/ -name "*_*" -print

brings zero results.


Thank you.


On 16/3/2023 6:30 μ.μ., backu...@kosowsky.org wrote:

Rob Sheldon wrote at about 08:31:17 -0700 on Thursday, March 16, 2023:
  > On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote:
  > >
  > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 2023:
  > > > There is no reason to be concerned. This is normal.
  > >
  > > It *should* be extremely, once-in-a-blue-moon, rare to randomly have an
  > > md5sum collision -- as in 1.47*10^-29
  >
  > Why are you assuming this is "randomly" happening? Any time an identical 
file exists in more than one place on the client filesystem, there will be a collision. This 
is common in lots of cases. Desktop environments frequently have duplicated files scattered 
around. I used BackupPC for website backups; my chain length was approximately equal to the 
number of WordPress sites I was hosting.

You are simply not understanding how file de-duplication and pool
chains work in v4.

Identical files contribute only a single chain instance -- no matter
how many clients you are backing up and no matter how many backups you
save of each client. This is what de-duplication does.

The fact that they appear on different clients and/or in different
parts of the filesystem is reflected in the attrib files in the pc
subdirectories for each client. This is where the metadata is stored.

Chain lengths have to do with pool storage of the file contents
(ignoring metadata). Lengths greater than 1 only occur if you have
md5sum hash collisions -- i.e., two files (no matter on what client or
where in the filesystem) with non-identical contents but the same
md5sum hash.

Such collisions are statistically exceedingly unlikely to occur on
normal data where you haven't worked hard to create such collisions.

For example, on my backup server:
Pool is 841.52+0.00GiB comprising 7395292+0 files and 16512+1 
directories (as of 2023-03-16 01:11),
Pool hashing gives 0+0 repeated files with longest chain 0+0,

I strongly suggest you read the documentation on BackupPC before
making wildly erroneous assumptions about chains. You can also look at
the code in BackupPC_refCountUpdate which defines how $fileCntRep and
$fileCntRepMax are calculated.

Also, if what you said were true, the OP would have multiple chains -
presumably one for each distinct file that is "scattered around"

If you are using v4.x and have pool hashing with such collisions, it
would be great to see them. I suspect you are either using v3 or you
are using v4 with a legacy v3 pool

  > > You would have to work hard to artificially create such collisions.
  >
  > $ echo 'hello world' > ~/file_a
  > $ cp ~/file_a ~/file_b
  > $ [ "$(cat ~/file_a | md5sum)" = "$(cat ~/file_b | md5sum)" ] && echo 
"MATCH"
  >
  > __
  > Rob Sheldon
  > Contract software developer, devops, security, technical lead
  >
  >
  > ___
  > BackupPC-users mailing list
  > BackupPC-users@lists.sourceforge.net
  > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
  > Wiki:https://github.com/backuppc/backuppc/wiki
  > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/



___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Rob Sheldon
On Thu, Mar 16, 2023, at 8:59 AM, Les Mikesell wrote:
> On Thu, Mar 16, 2023 at 10:53 AM Rob Sheldon  wrote:
> >
> > Why are you assuming this is "randomly" happening? Any time an identical 
> > file exists in more than one place on the client filesystem, there will be 
> > a collision. This is common in lots of cases. Desktop environments 
> > frequently have duplicated files scattered around. I used BackupPC for 
> > website backups; my chain length was approximately equal to the number of 
> > WordPress sites I was hosting.
> >
> Identical files are not collisions to backuppc - they are de-duplicated.

Hey Les,

Thanks for the heads-up. Your message prompted me to pull the current codebase 
and do some grepping around in it to better understand this. I believed that 
the MD5 hash was the mechanism used for deduplication, but that could have been 
wrong at any point, V3 or otherwise.

The mailing list tyrant's reply however made me realize I don't actually need 
to be chasing this down just right now, so I won't be following up on this 
further. I was dormant on this list for quite a few years -- which probably 
contributed to my misinformation -- and I suddenly have remembered why. :-)

__
Rob Sheldon
Contract software developer, devops, security, technical lead


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread
Rob Sheldon wrote at about 08:31:17 -0700 on Thursday, March 16, 2023:
 > On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote:
 > > 
 > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 2023:
 > > > There is no reason to be concerned. This is normal.
 > > 
 > > It *should* be extremely, once-in-a-blue-moon, rare to randomly have an
 > > md5sum collision -- as in 1.47*10^-29
 > 
 > Why are you assuming this is "randomly" happening? Any time an identical 
 > file exists in more than one place on the client filesystem, there will be a 
 > collision. This is common in lots of cases. Desktop environments frequently 
 > have duplicated files scattered around. I used BackupPC for website backups; 
 > my chain length was approximately equal to the number of WordPress sites I 
 > was hosting.

You are simply not understanding how file de-duplication and pool
chains work in v4.

Identical files contribute only a single chain instance -- no matter
how many clients you are backing up and no matter how many backups you
save of each client. This is what de-duplication does.

The fact that they appear on different clients and/or in different
parts of the filesystem is reflected in the attrib files in the pc
subdirectories for each client. This is where the metadata is stored.

Chain lengths have to do with pool storage of the file contents
(ignoring metadata). Lengths greater than 1 only occur if you have
md5sum hash collisions -- i.e., two files (no matter on what client or
where in the filesystem) with non-identical contents but the same
md5sum hash.

Such collisions are statistically exceedingly unlikely to occur on
normal data where you haven't worked hard to create such collisions.

For example, on my backup server:
Pool is 841.52+0.00GiB comprising 7395292+0 files and 16512+1 
directories (as of 2023-03-16 01:11),
Pool hashing gives 0+0 repeated files with longest chain 0+0,

I strongly suggest you read the documentation on BackupPC before
making wildly erroneous assumptions about chains. You can also look at
the code in BackupPC_refCountUpdate which defines how $fileCntRep and
$fileCntRepMax are calculated.

Also, if what you said were true, the OP would have multiple chains -
presumably one for each distinct file that is "scattered around"

If you are using v4.x and have pool hashing with such collisions, it
would be great to see them. I suspect you are either using v3 or you
are using v4 with a legacy v3 pool

 > > You would have to work hard to artificially create such collisions.
 > 
 > $ echo 'hello world' > ~/file_a
 > $ cp ~/file_a ~/file_b
 > $ [ "$(cat ~/file_a | md5sum)" = "$(cat ~/file_b | md5sum)" ] && echo "MATCH"
 > 
 > __
 > Rob Sheldon
 > Contract software developer, devops, security, technical lead
 > 
 > 
 > ___
 > BackupPC-users mailing list
 > BackupPC-users@lists.sourceforge.net
 > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > Wiki:https://github.com/backuppc/backuppc/wiki
 > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Les Mikesell
On Thu, Mar 16, 2023 at 10:53 AM Rob Sheldon  wrote:
>
> Why are you assuming this is "randomly" happening? Any time an identical file 
> exists in more than one place on the client filesystem, there will be a 
> collision. This is common in lots of cases. Desktop environments frequently 
> have duplicated files scattered around. I used BackupPC for website backups; 
> my chain length was approximately equal to the number of WordPress sites I 
> was hosting.
>
Identical files are not collisions to backuppc - they are de-duplicated.

-- 
   Les Mikesell
 lesmikes...@gmail.com


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Rob Sheldon
On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote:
> 
> Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 2023:
> > There is no reason to be concerned. This is normal.
> 
> It *should* be extremely, once-in-a-blue-moon, rare to randomly have an
> md5sum collision -- as in 1.47*10^-29

Why are you assuming this is "randomly" happening? Any time an identical file 
exists in more than one place on the client filesystem, there will be a 
collision. This is common in lots of cases. Desktop environments frequently 
have duplicated files scattered around. I used BackupPC for website backups; my 
chain length was approximately equal to the number of WordPress sites I was 
hosting.

> You would have to work hard to artificially create such collisions.

$ echo 'hello world' > ~/file_a
$ cp ~/file_a ~/file_b
$ [ "$(cat ~/file_a | md5sum)" = "$(cat ~/file_b | md5sum)" ] && echo "MATCH"

__
Rob Sheldon
Contract software developer, devops, security, technical lead


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread backuppc


Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 2023:
 > There is no reason to be concerned. This is normal.
Not really!
The OP claimed he is using v4 which uses full file md5sum checksums as
the file name hash.

It *should* be extremely, once-in-a-blue-moon, rare to randomly have an
md5sum collision -- as in 1.47*10^-29

Randomly having 165 hash collisions is several orders of magnitude
more unlikely.

You would have to work hard to artificially create such collisions.

 > Searching for "backuppc pool hashing chain" finds 
 > https://sourceforge.net/p/backuppc/mailman/message/20671583/ for example. 
 > The documentation could perhaps be more specific about this, but BackupPC 
 > "chains" hash collisions, so no data is lost or damaged. This is a pretty 
 > standard compsci data structure. If you search the BackupPC repository for 
 > the message you're seeing, you find 
 > https://github.com/backuppc/backuppc/blob/43c91d83d06b4af3760400365da62e1fd5ee584e/lib/BackupPC/Lang/en.pm#L311,
 >  which gives you a couple of variable names that you can use to search the 
 > codebase and satisfy your curiosity.
 >

This OLD (ca 2008) reference is totally IRRELEVANT for v4.x.
Hash collisions on v3.x were EXTREMELY common and it was not unusual to
have even long chains of collisions since the md5sum was computed
using only the first and last 128KB (I think) of the file plus the
file length. So any long file whose middle section changed by even
just one byte would create a collision chain.

 > "165 repeated files with longest chain 1" just means that there are 165 
 > different files that had a digest (hash) that matched another file. This can 
 > happen for a number of different reasons. A common one is that there are 
 > identical files on the backuppc client system that have different paths.
 > 
 > I've been running BackupPC for a *long* time and have always had chained 
 > hashes in my server status. It's an informational message, not an error.
 > 

Are you running v3 or v4?
You are right though it's technically not an error even in v4 in that
md5sum collisions do of course exist -- just that they are extremely rare.

It would REALLY REALLY REALLY help if the OP would do some WORK to
help HIMSELF diagnose the problem.

Since it seems like only a single md5sum chain is involved, it would
seem blindingly obvious that the first step in troubleshooting would
be to determine what is the nature of these files with the same md5sum
hash.

Specifically,
- Do they indeed all have identical md5sum?
- Are they indeed distinct files (despite having the same md5sum)?
- How if at all are these files special?

 > On Tue, Mar 14, 2023, at 12:41 PM, Giannis Economou wrote:
 > > The message appears in "Server Information" section, inside server 
 > > status, in the web interface of BackupPC (page "BackupPC Server Status").
 > > 
 > > Both servers are running BackupPC version 4.4.0.
 > > 1st server says:  "Pool hashing gives 165 repeated files with longest 
 > > chain 1"
 > > 2nd server says:  "Pool hashing gives 5 repeated files with longest chain 
 > > 1"
 > > 
 > > I could not find more info/details about this message in the documentation.
 > > Only a comment in github is mentioning this message, and to my 
 > > understanding it is related to hash collisions in the pool.
 > > 
 > > Since hash collisions might be worrying (in theory), my question is if 
 > > this is alarming and in that case if there is something that we can do 
 > > about it (for example actions to zero out any collisions if needed).
 > > 
 > > 
 > > Thank you very much.
 > 
 > __
 > Rob Sheldon
 > Contract software developer, devops, security, technical lead
 > 
 > 
 > ___
 > BackupPC-users mailing list
 > BackupPC-users@lists.sourceforge.net
 > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > Wiki:https://github.com/backuppc/backuppc/wiki
 > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Giannis Economou

All is clear now.
Thank you so much for your detailed reply.


On 16/3/2023 8:54 π.μ., Rob Sheldon wrote:

There is no reason to be concerned. This is normal.

Searching for "backuppc pool hashing chain" finds 
https://sourceforge.net/p/backuppc/mailman/message/20671583/ for example. The documentation could 
perhaps be more specific about this, but BackupPC "chains" hash collisions, so no data is 
lost or damaged. This is a pretty standard compsci data structure. If you search the BackupPC 
repository for the message you're seeing, you find 
https://github.com/backuppc/backuppc/blob/43c91d83d06b4af3760400365da62e1fd5ee584e/lib/BackupPC/Lang/en.pm#L311,
 which gives you a couple of variable names that you can use to search the codebase and satisfy 
your curiosity.

"165 repeated files with longest chain 1" just means that there are 165 
different files that had a digest (hash) that matched another file. This can happen for a 
number of different reasons. A common one is that there are identical files on the 
backuppc client system that have different paths.

I've been running BackupPC for a *long* time and have always had chained hashes 
in my server status. It's an informational message, not an error.

On Tue, Mar 14, 2023, at 12:41 PM, Giannis Economou wrote:

The message appears in "Server Information" section, inside server
status, in the web interface of BackupPC (page "BackupPC Server Status").

Both servers are running BackupPC version 4.4.0.
1st server says:  "Pool hashing gives 165 repeated files with longest
chain 1"
2nd server says:  "Pool hashing gives 5 repeated files with longest chain 1"

I could not find more info/details about this message in the documentation.
Only a comment in github is mentioning this message, and to my
understanding it is related to hash collisions in the pool.

Since hash collisions might be worrying (in theory), my question is if
this is alarming and in that case if there is something that we can do
about it (for example actions to zero out any collisions if needed).


Thank you very much.

__
Rob Sheldon
Contract software developer, devops, security, technical lead


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/



___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Rob Sheldon
There is no reason to be concerned. This is normal.

Searching for "backuppc pool hashing chain" finds 
https://sourceforge.net/p/backuppc/mailman/message/20671583/ for example. The 
documentation could perhaps be more specific about this, but BackupPC "chains" 
hash collisions, so no data is lost or damaged. This is a pretty standard 
compsci data structure. If you search the BackupPC repository for the message 
you're seeing, you find 
https://github.com/backuppc/backuppc/blob/43c91d83d06b4af3760400365da62e1fd5ee584e/lib/BackupPC/Lang/en.pm#L311,
 which gives you a couple of variable names that you can use to search the 
codebase and satisfy your curiosity.

"165 repeated files with longest chain 1" just means that there are 165 
different files that had a digest (hash) that matched another file. This can 
happen for a number of different reasons. A common one is that there are 
identical files on the backuppc client system that have different paths.

I've been running BackupPC for a *long* time and have always had chained hashes 
in my server status. It's an informational message, not an error.

On Tue, Mar 14, 2023, at 12:41 PM, Giannis Economou wrote:
> The message appears in "Server Information" section, inside server 
> status, in the web interface of BackupPC (page "BackupPC Server Status").
> 
> Both servers are running BackupPC version 4.4.0.
> 1st server says:  "Pool hashing gives 165 repeated files with longest 
> chain 1"
> 2nd server says:  "Pool hashing gives 5 repeated files with longest chain 1"
> 
> I could not find more info/details about this message in the documentation.
> Only a comment in github is mentioning this message, and to my 
> understanding it is related to hash collisions in the pool.
> 
> Since hash collisions might be worrying (in theory), my question is if 
> this is alarming and in that case if there is something that we can do 
> about it (for example actions to zero out any collisions if needed).
> 
> 
> Thank you very much.

__
Rob Sheldon
Contract software developer, devops, security, technical lead


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-14 Thread John
That's the sort of attitude that puts genuine people off asking for help 
on mailing lists. Rather than posting a cutting reply, why not simply 
ask for a little more information if that's what's needed? Please try to 
remember that English is not everyone's first language and at the end of 
the day we are all human and have feelings...


On 14/03/2023 19:17, backu...@kosowsky.org wrote:

Can you please invest some minimal effort in asking a specific
question and supplying relevant data?

Not just a random line pulled presumably from a log file or web page
and saying "I guess it's a problem".

Giannis Economou wrote at about 15:59:27 +0200 on Tuesday, March 14, 2023:
  > Hello,
  >
  > running 2 backuppc servers and:
  >
  > * on 1st one we get:
  > Pool hashing gives 165 repeated files with longest chain 1
  >
  > * on 2nd one we get:
  > Pool hashing gives 5 repeated files with longest chain 1
  >
  >
  > I guess this might be a problem? If yes, what can we do about this?
  >
  >
  >
  > Thank you.
  >
  >
  >
  >
  >
  > ___
  > BackupPC-users mailing list
  > BackupPC-users@lists.sourceforge.net
  > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
  > Wiki:https://github.com/backuppc/backuppc/wiki
  > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/



___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-14 Thread Giannis Economou
The message appears in "Server Information" section, inside server 
status, in the web interface of BackupPC (page "BackupPC Server Status").


Both servers are running BackupPC version 4.4.0.
1st server says:  "Pool hashing gives 165 repeated files with longest 
chain 1"

2nd server says:  "Pool hashing gives 5 repeated files with longest chain 1"

I could not find more info/details about this message in the documentation.
Only a comment in github is mentioning this message, and to my 
understanding it is related to hash collisions in the pool.


Since hash collisions might be worrying (in theory), my question is if 
this is alarming and in that case if there is something that we can do 
about it (for example actions to zero out any collisions if needed).



Thank you very much.



On 14/3/2023 9:17 μ.μ., backu...@kosowsky.org wrote:

Can you please invest some minimal effort in asking a specific
question and supplying relevant data?

Not just a random line pulled presumably from a log file or web page
and saying "I guess it's a problem".

Giannis Economou wrote at about 15:59:27 +0200 on Tuesday, March 14, 2023:
  > Hello,
  >
  > running 2 backuppc servers and:
  >
  > * on 1st one we get:
  > Pool hashing gives 165 repeated files with longest chain 1
  >
  > * on 2nd one we get:
  > Pool hashing gives 5 repeated files with longest chain 1
  >
  >
  > I guess this might be a problem? If yes, what can we do about this?
  >
  >
  >
  > Thank you.
  >
  >
  >
  >
  >
  > ___
  > BackupPC-users mailing list
  > BackupPC-users@lists.sourceforge.net
  > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
  > Wiki:https://github.com/backuppc/backuppc/wiki
  > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/



___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-14 Thread backuppc
Can you please invest some minimal effort in asking a specific
question and supplying relevant data?

Not just a random line pulled presumably from a log file or web page
and saying "I guess it's a problem".

Giannis Economou wrote at about 15:59:27 +0200 on Tuesday, March 14, 2023:
 > Hello,
 > 
 > running 2 backuppc servers and:
 > 
 > * on 1st one we get:
 > Pool hashing gives 165 repeated files with longest chain 1
 > 
 > * on 2nd one we get:
 > Pool hashing gives 5 repeated files with longest chain 1
 > 
 > 
 > I guess this might be a problem? If yes, what can we do about this?
 > 
 > 
 > 
 > Thank you.
 > 
 > 
 > 
 > 
 > 
 > ___
 > BackupPC-users mailing list
 > BackupPC-users@lists.sourceforge.net
 > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > Wiki:https://github.com/backuppc/backuppc/wiki
 > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


[BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-14 Thread Giannis Economou

Hello,

running 2 backuppc servers and:

* on 1st one we get:
Pool hashing gives 165 repeated files with longest chain 1

* on 2nd one we get:
Pool hashing gives 5 repeated files with longest chain 1


I guess this might be a problem? If yes, what can we do about this?



Thank you.





___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/