Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread backuppc
Giannis Economou wrote at about 23:18:25 +0200 on Thursday, March 16, 2023:
 > Thank you for the extra useful "find" command.
 > I see from your command: those extra collisions are not 32chars files 
 > but 35 due to the "ext" (extension), but without _ separator.
 > 
 > Based on that, I did my investigation.
 > 
 > Some findings below...
 > 
 > A. First BackupPC server:
 > This is a backup server running since Jan 2023 (replaced other bpc 
 > servers running for years).
 > Status now:
 > - Pool is 10311.16GiB comprising 50892587 files and 16512 directories 
 > (as of 2023-03-16 13:23),
 > - Pool hashing gives 5 repeated files with longest chain 1,
 > - Nightly cleanup removed 3100543 files of size 1206.32GiB (around 
 > 2023-03-16 13:23),
 > 
 >  From those 5 cpool files, I investigated two of them (using the nice 
 > wiki page here 
 > https://github.com/backuppc/backuppc/wiki/How-to-find-which-backups-reference-a-particular-pool-file)
 > Each file of those two under investigation, was referenced in different 
 > hosts and different individual backups.
 > 
 > For both cpool files investigated, the actual file on disk was *exactly* 
 > the same file.
 > So we talk about a file in different hosts, in different paths, 
 > different dates, but having the same filename and same content.
 > So practically I understand that we have deduplications that did not 
 > happen for some reason.

Either there is some rare edge or condition in BackupPC causing
de-duplication to fail (which I think is unlikely) or perhaps at some
point your backuppc instance crashed in the middle of saving one of
the above pool files in some indeterminate state that caused it to
create a duplicate in the next backup.

Did you by any chance change the compression level? (though I don't
think this will break de-duplication)

What is the creation date of each of the chain elements? My Occam's
razor guess would be that all 5 were created as part of the same backup.
 > 
 > B. Second BackupPC server:
 > This is a brand new backup server, running only for 3 days until today, 
 > not even a week youg (also replaced other bpc servers running for years).
 > Therefore we only have a few backups there until today, more will add up 
 > as days go by.
 > Status now:
 > - Pool is 2207.69GiB comprising 26161662 files and 12384 directories (as 
 > of 2023-03-16 13:11),
 > - Pool hashing gives 275 repeated files with longest chain 1,
 > - Nightly cleanup removed 0 files of size 0.00GiB (around 2023-03-16 13:11),
 > 
 > What I see here is a different situation...
 > 
 > 1. Count of files found by "find" is different that the collisions 
 > reported in status page (find counts 231, bpc status page counts 275)
 > 2. Many pool files among the list from "find", are not referenced at all 
 > in poolCnt files (give a count of 0 in their "poolCnt")
 > 3. Even worse, many cpool/*/poolCnt files are completely missing from 
 > the cpool/.
 > For example I have a colission file:
 > cpool/ce/b0/ceb1e6764a28b208d51a7801052118d701
 > but file: cpool/ce/poolCnt
 > does not even exist.
 > 
 > Maybe refCount is not completed yet (because server is still new)?
 > 
 > One guess is that this might happen because I have:
 >     BackupPCNightlyPeriod: 4
 >     PoolSizeNightlyUpdatePeriod: 16
 >     PoolNightlyDigestCheckPercent: 1
 > and the server is still new.
 > 
 > I will wait for a few days.
 > (BTW, storage is local and everything seems error free from day-1 on the 
 > system)
 > 
 > If such inconsistencies persist, I guess I will have to investigate 
 > "BackupPC_fsck" for this server.
 > 
 > 

As you suggest, it's really hard to know whether there is an issue
when you have not yet run a full BackupPC_nightly.

That being said, why do you divide it up over so many days? Is your
backup set that *large* that it can't complete in fewer nights?

 > 
 > 
 > On 16/3/2023 8:31 μ.μ., backu...@kosowsky.org wrote:
 > > Giannis Economou wrote at about 19:12:56 +0200 on Thursday, March 16, 2023:
 > >   > In my v4 pool are collisions still generating _0, _1, _2 etc filenames
 > >   > in the pool/ ?
 > >
 > > According to the code in Lib.pm, it appears that unlike v3, there is
 > > no underscore -- it's just an (unsigned) long added to the end of the
 > > 16 byte digest.
 > >
 > >   >
 > >   > (as in the example from the docs mentions:
 > >   >      __TOPDIR__/pool/1/2/3/123456789abcdef0
 > >   >      __TOPDIR__/pool/1/2/3/123456789abcdef0_0
 > >   >      __TOPDIR__/pool/1/2/3/123456789abcdef0_1
 > >   > )
 > >
 > > That is for v3 as indicated by the 3-layer pool.
 > >
 > >   >
 > >   > I am using compression (I only have cpool/ dir) and I am asking because
 > >   > on both servers running:
 > >   >          find cpool/ -name "*_0" -print
 > >   >          find cpool/ -name "*_*" -print
 > >   >
 > >   > brings zero results.
 > >
 > > Try:
 > >
 > >  find /var/lib/backuppc/cpool/ -type f -regextype grep ! -regex 
 > > ".*/[0-9a-f]\{32\}" ! -name "LOCK" ! -name "poolCnt"

Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Giannis Economou

Thank you for the extra useful "find" command.
I see from your command: those extra collisions are not 32chars files 
but 35 due to the "ext" (extension), but without _ separator.


Based on that, I did my investigation.

Some findings below...

A. First BackupPC server:
This is a backup server running since Jan 2023 (replaced other bpc 
servers running for years).

Status now:
- Pool is 10311.16GiB comprising 50892587 files and 16512 directories 
(as of 2023-03-16 13:23),

- Pool hashing gives 5 repeated files with longest chain 1,
- Nightly cleanup removed 3100543 files of size 1206.32GiB (around 
2023-03-16 13:23),


From those 5 cpool files, I investigated two of them (using the nice 
wiki page here 
https://github.com/backuppc/backuppc/wiki/How-to-find-which-backups-reference-a-particular-pool-file)
Each file of those two under investigation, was referenced in different 
hosts and different individual backups.


For both cpool files investigated, the actual file on disk was *exactly* 
the same file.
So we talk about a file in different hosts, in different paths, 
different dates, but having the same filename and same content.
So practically I understand that we have deduplications that did not 
happen for some reason.


B. Second BackupPC server:
This is a brand new backup server, running only for 3 days until today, 
not even a week youg (also replaced other bpc servers running for years).
Therefore we only have a few backups there until today, more will add up 
as days go by.

Status now:
- Pool is 2207.69GiB comprising 26161662 files and 12384 directories (as 
of 2023-03-16 13:11),

- Pool hashing gives 275 repeated files with longest chain 1,
- Nightly cleanup removed 0 files of size 0.00GiB (around 2023-03-16 13:11),

What I see here is a different situation...

1. Count of files found by "find" is different that the collisions 
reported in status page (find counts 231, bpc status page counts 275)
2. Many pool files among the list from "find", are not referenced at all 
in poolCnt files (give a count of 0 in their "poolCnt")
3. Even worse, many cpool/*/poolCnt files are completely missing from 
the cpool/.

For example I have a colission file:
cpool/ce/b0/ceb1e6764a28b208d51a7801052118d701
but file: cpool/ce/poolCnt
does not even exist.

Maybe refCount is not completed yet (because server is still new)?

One guess is that this might happen because I have:
   BackupPCNightlyPeriod: 4
   PoolSizeNightlyUpdatePeriod: 16
   PoolNightlyDigestCheckPercent: 1
and the server is still new.

I will wait for a few days.
(BTW, storage is local and everything seems error free from day-1 on the 
system)


If such inconsistencies persist, I guess I will have to investigate 
"BackupPC_fsck" for this server.



Thank you.


On 16/3/2023 8:31 μ.μ., backu...@kosowsky.org wrote:

Giannis Economou wrote at about 19:12:56 +0200 on Thursday, March 16, 2023:
  > In my v4 pool are collisions still generating _0, _1, _2 etc filenames
  > in the pool/ ?

According to the code in Lib.pm, it appears that unlike v3, there is
no underscore -- it's just an (unsigned) long added to the end of the
16 byte digest.

  >
  > (as in the example from the docs mentions:
  >      __TOPDIR__/pool/1/2/3/123456789abcdef0
  >      __TOPDIR__/pool/1/2/3/123456789abcdef0_0
  >      __TOPDIR__/pool/1/2/3/123456789abcdef0_1
  > )

That is for v3 as indicated by the 3-layer pool.

  >
  > I am using compression (I only have cpool/ dir) and I am asking because
  > on both servers running:
  >          find cpool/ -name "*_0" -print
  >          find cpool/ -name "*_*" -print
  >
  > brings zero results.

Try:

 find /var/lib/backuppc/cpool/ -type f -regextype grep ! -regex ".*/[0-9a-f]\{32\}" ! -name 
"LOCK" ! -name "poolCnt"

  >
  >
  > Thank you.
  >
  >
  > On 16/3/2023 6:30 μ.μ., backu...@kosowsky.org wrote:
  > > Rob Sheldon wrote at about 08:31:17 -0700 on Thursday, March 16, 2023:
  > >   > On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote:
  > >   > >
  > >   > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 
2023:
  > >   > > > There is no reason to be concerned. This is normal.
  > >   > >
  > >   > > It *should* be extremely, once-in-a-blue-moon, rare to randomly 
have an
  > >   > > md5sum collision -- as in 1.47*10^-29
  > >   >
  > >   > Why are you assuming this is "randomly" happening? Any time an 
identical file exists in more than one place on the client filesystem, there will be a collision. 
This is common in lots of cases. Desktop environments frequently have duplicated files scattered 
around. I used BackupPC for website backups; my chain length was approximately equal to the number 
of WordPress sites I was hosting.
  > >
  > > You are simply not understanding how file de-duplication and pool
  > > chains work in v4.
  > >
  > > Identical files contribute only a single chain instance -- no matter
  > > how many clients you are backing up and no matter how many backups you
  

Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread
Giannis Economou wrote at about 19:12:56 +0200 on Thursday, March 16, 2023:
 > In my v4 pool are collisions still generating _0, _1, _2 etc filenames 
 > in the pool/ ?

According to the code in Lib.pm, it appears that unlike v3, there is
no underscore -- it's just an (unsigned) long added to the end of the
16 byte digest.

 > 
 > (as in the example from the docs mentions:
 >      __TOPDIR__/pool/1/2/3/123456789abcdef0
 >      __TOPDIR__/pool/1/2/3/123456789abcdef0_0
 >      __TOPDIR__/pool/1/2/3/123456789abcdef0_1
 > )

That is for v3 as indicated by the 3-layer pool.

 > 
 > I am using compression (I only have cpool/ dir) and I am asking because 
 > on both servers running:
 >          find cpool/ -name "*_0" -print
 >          find cpool/ -name "*_*" -print
 > 
 > brings zero results.

Try:

find /var/lib/backuppc/cpool/ -type f -regextype grep ! -regex 
".*/[0-9a-f]\{32\}" ! -name "LOCK" ! -name "poolCnt"

 > 
 > 
 > Thank you.
 > 
 > 
 > On 16/3/2023 6:30 μ.μ., backu...@kosowsky.org wrote:
 > > Rob Sheldon wrote at about 08:31:17 -0700 on Thursday, March 16, 2023:
 > >   > On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote:
 > >   > >
 > >   > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 
 > > 2023:
 > >   > > > There is no reason to be concerned. This is normal.
 > >   > >
 > >   > > It *should* be extremely, once-in-a-blue-moon, rare to randomly have 
 > > an
 > >   > > md5sum collision -- as in 1.47*10^-29
 > >   >
 > >   > Why are you assuming this is "randomly" happening? Any time an 
 > > identical file exists in more than one place on the client filesystem, 
 > > there will be a collision. This is common in lots of cases. Desktop 
 > > environments frequently have duplicated files scattered around. I used 
 > > BackupPC for website backups; my chain length was approximately equal to 
 > > the number of WordPress sites I was hosting.
 > >
 > > You are simply not understanding how file de-duplication and pool
 > > chains work in v4.
 > >
 > > Identical files contribute only a single chain instance -- no matter
 > > how many clients you are backing up and no matter how many backups you
 > > save of each client. This is what de-duplication does.
 > >
 > > The fact that they appear on different clients and/or in different
 > > parts of the filesystem is reflected in the attrib files in the pc
 > > subdirectories for each client. This is where the metadata is stored.
 > >
 > > Chain lengths have to do with pool storage of the file contents
 > > (ignoring metadata). Lengths greater than 1 only occur if you have
 > > md5sum hash collisions -- i.e., two files (no matter on what client or
 > > where in the filesystem) with non-identical contents but the same
 > > md5sum hash.
 > >
 > > Such collisions are statistically exceedingly unlikely to occur on
 > > normal data where you haven't worked hard to create such collisions.
 > >
 > > For example, on my backup server:
 > >Pool is 841.52+0.00GiB comprising 7395292+0 files and 16512+1 
 > > directories (as of 2023-03-16 01:11),
 > >Pool hashing gives 0+0 repeated files with longest chain 0+0,
 > >
 > > I strongly suggest you read the documentation on BackupPC before
 > > making wildly erroneous assumptions about chains. You can also look at
 > > the code in BackupPC_refCountUpdate which defines how $fileCntRep and
 > > $fileCntRepMax are calculated.
 > >
 > > Also, if what you said were true, the OP would have multiple chains -
 > > presumably one for each distinct file that is "scattered around"
 > >
 > > If you are using v4.x and have pool hashing with such collisions, it
 > > would be great to see them. I suspect you are either using v3 or you
 > > are using v4 with a legacy v3 pool
 > >
 > >   > > You would have to work hard to artificially create such collisions.
 > >   >
 > >   > $ echo 'hello world' > ~/file_a
 > >   > $ cp ~/file_a ~/file_b
 > >   > $ [ "$(cat ~/file_a | md5sum)" = "$(cat ~/file_b | md5sum)" ] && echo 
 > > "MATCH"
 > >   >
 > >   > __
 > >   > Rob Sheldon
 > >   > Contract software developer, devops, security, technical lead
 > >   >
 > >   >
 > >   > ___
 > >   > BackupPC-users mailing list
 > >   > BackupPC-users@lists.sourceforge.net
 > >   > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > >   > Wiki:https://github.com/backuppc/backuppc/wiki
 > >   > Project: https://backuppc.github.io/backuppc/
 > >
 > >
 > > ___
 > > BackupPC-users mailing list
 > > BackupPC-users@lists.sourceforge.net
 > > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > > Wiki:https://github.com/backuppc/backuppc/wiki
 > > Project: https://backuppc.github.io/backuppc/
 > 
 > 
 > ___
 > BackupPC-users mailing list
 > BackupPC-users@lists.sourceforge.net
 > List:https://lists.sourceforge.net/lists/listinfo/backuppc-

Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread
Rob Sheldon wrote at about 09:56:51 -0700 on Thursday, March 16, 2023:
 > The mailing list tyrant's reply however made me realize I don't actually 
 > need to be chasing this down just right now, so I won't be following up on 
 > this further.
 >  I was dormant on this list for quite a few years -- which probably 
 > contributed to my misinformation -- and I suddenly have remembered why. :-)

Whatever.

- I help people but have high expectations that if I am going to
  invest time in helping people with answers then I in turn expect
  those seeking help to invest time and effort in troubleshooting the
  problem in advance and formulating a specific question. This is a
  sysadmin-level tool and I don't get paid to coddle users.

- You have no standards for how questions are formulated and then proceed
  repeatedly to give diametrically wrong advice. Garbage in, garbage
  out.

Similarly,
- I spent significant time today reviewing the code and verifying my
  understanding of BackupPC before answering

- You seemingly spent zero time researching the problem and just cited
  your vague understanding of how BackupPC works based on information
  that you admit is from "quite a few years" ago

Who is a more helpful contributor to the list?

Please put on some big boy pants...


> 
 > __
 > Rob Sheldon
 > Contract software developer, devops, security, technical lead
 > 
 > 
 > ___
 > BackupPC-users mailing list
 > BackupPC-users@lists.sourceforge.net
 > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > Wiki:https://github.com/backuppc/backuppc/wiki
 > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Giannis Economou
In my v4 pool are collisions still generating _0, _1, _2 etc filenames 
in the pool/ ?


(as in the example from the docs mentions:
    __TOPDIR__/pool/1/2/3/123456789abcdef0
    __TOPDIR__/pool/1/2/3/123456789abcdef0_0
    __TOPDIR__/pool/1/2/3/123456789abcdef0_1
)

I am using compression (I only have cpool/ dir) and I am asking because 
on both servers running:

        find cpool/ -name "*_0" -print
        find cpool/ -name "*_*" -print

brings zero results.


Thank you.


On 16/3/2023 6:30 μ.μ., backu...@kosowsky.org wrote:

Rob Sheldon wrote at about 08:31:17 -0700 on Thursday, March 16, 2023:
  > On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote:
  > >
  > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 2023:
  > > > There is no reason to be concerned. This is normal.
  > >
  > > It *should* be extremely, once-in-a-blue-moon, rare to randomly have an
  > > md5sum collision -- as in 1.47*10^-29
  >
  > Why are you assuming this is "randomly" happening? Any time an identical 
file exists in more than one place on the client filesystem, there will be a collision. This 
is common in lots of cases. Desktop environments frequently have duplicated files scattered 
around. I used BackupPC for website backups; my chain length was approximately equal to the 
number of WordPress sites I was hosting.

You are simply not understanding how file de-duplication and pool
chains work in v4.

Identical files contribute only a single chain instance -- no matter
how many clients you are backing up and no matter how many backups you
save of each client. This is what de-duplication does.

The fact that they appear on different clients and/or in different
parts of the filesystem is reflected in the attrib files in the pc
subdirectories for each client. This is where the metadata is stored.

Chain lengths have to do with pool storage of the file contents
(ignoring metadata). Lengths greater than 1 only occur if you have
md5sum hash collisions -- i.e., two files (no matter on what client or
where in the filesystem) with non-identical contents but the same
md5sum hash.

Such collisions are statistically exceedingly unlikely to occur on
normal data where you haven't worked hard to create such collisions.

For example, on my backup server:
Pool is 841.52+0.00GiB comprising 7395292+0 files and 16512+1 
directories (as of 2023-03-16 01:11),
Pool hashing gives 0+0 repeated files with longest chain 0+0,

I strongly suggest you read the documentation on BackupPC before
making wildly erroneous assumptions about chains. You can also look at
the code in BackupPC_refCountUpdate which defines how $fileCntRep and
$fileCntRepMax are calculated.

Also, if what you said were true, the OP would have multiple chains -
presumably one for each distinct file that is "scattered around"

If you are using v4.x and have pool hashing with such collisions, it
would be great to see them. I suspect you are either using v3 or you
are using v4 with a legacy v3 pool

  > > You would have to work hard to artificially create such collisions.
  >
  > $ echo 'hello world' > ~/file_a
  > $ cp ~/file_a ~/file_b
  > $ [ "$(cat ~/file_a | md5sum)" = "$(cat ~/file_b | md5sum)" ] && echo 
"MATCH"
  >
  > __
  > Rob Sheldon
  > Contract software developer, devops, security, technical lead
  >
  >
  > ___
  > BackupPC-users mailing list
  > BackupPC-users@lists.sourceforge.net
  > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
  > Wiki:https://github.com/backuppc/backuppc/wiki
  > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/



___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Rob Sheldon
On Thu, Mar 16, 2023, at 8:59 AM, Les Mikesell wrote:
> On Thu, Mar 16, 2023 at 10:53 AM Rob Sheldon  wrote:
> >
> > Why are you assuming this is "randomly" happening? Any time an identical 
> > file exists in more than one place on the client filesystem, there will be 
> > a collision. This is common in lots of cases. Desktop environments 
> > frequently have duplicated files scattered around. I used BackupPC for 
> > website backups; my chain length was approximately equal to the number of 
> > WordPress sites I was hosting.
> >
> Identical files are not collisions to backuppc - they are de-duplicated.

Hey Les,

Thanks for the heads-up. Your message prompted me to pull the current codebase 
and do some grepping around in it to better understand this. I believed that 
the MD5 hash was the mechanism used for deduplication, but that could have been 
wrong at any point, V3 or otherwise.

The mailing list tyrant's reply however made me realize I don't actually need 
to be chasing this down just right now, so I won't be following up on this 
further. I was dormant on this list for quite a few years -- which probably 
contributed to my misinformation -- and I suddenly have remembered why. :-)

__
Rob Sheldon
Contract software developer, devops, security, technical lead


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread
Rob Sheldon wrote at about 08:31:17 -0700 on Thursday, March 16, 2023:
 > On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote:
 > > 
 > > Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 2023:
 > > > There is no reason to be concerned. This is normal.
 > > 
 > > It *should* be extremely, once-in-a-blue-moon, rare to randomly have an
 > > md5sum collision -- as in 1.47*10^-29
 > 
 > Why are you assuming this is "randomly" happening? Any time an identical 
 > file exists in more than one place on the client filesystem, there will be a 
 > collision. This is common in lots of cases. Desktop environments frequently 
 > have duplicated files scattered around. I used BackupPC for website backups; 
 > my chain length was approximately equal to the number of WordPress sites I 
 > was hosting.

You are simply not understanding how file de-duplication and pool
chains work in v4.

Identical files contribute only a single chain instance -- no matter
how many clients you are backing up and no matter how many backups you
save of each client. This is what de-duplication does.

The fact that they appear on different clients and/or in different
parts of the filesystem is reflected in the attrib files in the pc
subdirectories for each client. This is where the metadata is stored.

Chain lengths have to do with pool storage of the file contents
(ignoring metadata). Lengths greater than 1 only occur if you have
md5sum hash collisions -- i.e., two files (no matter on what client or
where in the filesystem) with non-identical contents but the same
md5sum hash.

Such collisions are statistically exceedingly unlikely to occur on
normal data where you haven't worked hard to create such collisions.

For example, on my backup server:
Pool is 841.52+0.00GiB comprising 7395292+0 files and 16512+1 
directories (as of 2023-03-16 01:11),
Pool hashing gives 0+0 repeated files with longest chain 0+0,

I strongly suggest you read the documentation on BackupPC before
making wildly erroneous assumptions about chains. You can also look at
the code in BackupPC_refCountUpdate which defines how $fileCntRep and
$fileCntRepMax are calculated.

Also, if what you said were true, the OP would have multiple chains -
presumably one for each distinct file that is "scattered around"

If you are using v4.x and have pool hashing with such collisions, it
would be great to see them. I suspect you are either using v3 or you
are using v4 with a legacy v3 pool

 > > You would have to work hard to artificially create such collisions.
 > 
 > $ echo 'hello world' > ~/file_a
 > $ cp ~/file_a ~/file_b
 > $ [ "$(cat ~/file_a | md5sum)" = "$(cat ~/file_b | md5sum)" ] && echo "MATCH"
 > 
 > __
 > Rob Sheldon
 > Contract software developer, devops, security, technical lead
 > 
 > 
 > ___
 > BackupPC-users mailing list
 > BackupPC-users@lists.sourceforge.net
 > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > Wiki:https://github.com/backuppc/backuppc/wiki
 > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Les Mikesell
On Thu, Mar 16, 2023 at 10:53 AM Rob Sheldon  wrote:
>
> Why are you assuming this is "randomly" happening? Any time an identical file 
> exists in more than one place on the client filesystem, there will be a 
> collision. This is common in lots of cases. Desktop environments frequently 
> have duplicated files scattered around. I used BackupPC for website backups; 
> my chain length was approximately equal to the number of WordPress sites I 
> was hosting.
>
Identical files are not collisions to backuppc - they are de-duplicated.

-- 
   Les Mikesell
 lesmikes...@gmail.com


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Rob Sheldon
On Thu, Mar 16, 2023, at 7:43 AM, backu...@kosowsky.org wrote:
> 
> Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 2023:
> > There is no reason to be concerned. This is normal.
> 
> It *should* be extremely, once-in-a-blue-moon, rare to randomly have an
> md5sum collision -- as in 1.47*10^-29

Why are you assuming this is "randomly" happening? Any time an identical file 
exists in more than one place on the client filesystem, there will be a 
collision. This is common in lots of cases. Desktop environments frequently 
have duplicated files scattered around. I used BackupPC for website backups; my 
chain length was approximately equal to the number of WordPress sites I was 
hosting.

> You would have to work hard to artificially create such collisions.

$ echo 'hello world' > ~/file_a
$ cp ~/file_a ~/file_b
$ [ "$(cat ~/file_a | md5sum)" = "$(cat ~/file_b | md5sum)" ] && echo "MATCH"

__
Rob Sheldon
Contract software developer, devops, security, technical lead


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread backuppc


Rob Sheldon wrote at about 23:54:51 -0700 on Wednesday, March 15, 2023:
 > There is no reason to be concerned. This is normal.
Not really!
The OP claimed he is using v4 which uses full file md5sum checksums as
the file name hash.

It *should* be extremely, once-in-a-blue-moon, rare to randomly have an
md5sum collision -- as in 1.47*10^-29

Randomly having 165 hash collisions is several orders of magnitude
more unlikely.

You would have to work hard to artificially create such collisions.

 > Searching for "backuppc pool hashing chain" finds 
 > https://sourceforge.net/p/backuppc/mailman/message/20671583/ for example. 
 > The documentation could perhaps be more specific about this, but BackupPC 
 > "chains" hash collisions, so no data is lost or damaged. This is a pretty 
 > standard compsci data structure. If you search the BackupPC repository for 
 > the message you're seeing, you find 
 > https://github.com/backuppc/backuppc/blob/43c91d83d06b4af3760400365da62e1fd5ee584e/lib/BackupPC/Lang/en.pm#L311,
 >  which gives you a couple of variable names that you can use to search the 
 > codebase and satisfy your curiosity.
 >

This OLD (ca 2008) reference is totally IRRELEVANT for v4.x.
Hash collisions on v3.x were EXTREMELY common and it was not unusual to
have even long chains of collisions since the md5sum was computed
using only the first and last 128KB (I think) of the file plus the
file length. So any long file whose middle section changed by even
just one byte would create a collision chain.

 > "165 repeated files with longest chain 1" just means that there are 165 
 > different files that had a digest (hash) that matched another file. This can 
 > happen for a number of different reasons. A common one is that there are 
 > identical files on the backuppc client system that have different paths.
 > 
 > I've been running BackupPC for a *long* time and have always had chained 
 > hashes in my server status. It's an informational message, not an error.
 > 

Are you running v3 or v4?
You are right though it's technically not an error even in v4 in that
md5sum collisions do of course exist -- just that they are extremely rare.

It would REALLY REALLY REALLY help if the OP would do some WORK to
help HIMSELF diagnose the problem.

Since it seems like only a single md5sum chain is involved, it would
seem blindingly obvious that the first step in troubleshooting would
be to determine what is the nature of these files with the same md5sum
hash.

Specifically,
- Do they indeed all have identical md5sum?
- Are they indeed distinct files (despite having the same md5sum)?
- How if at all are these files special?

 > On Tue, Mar 14, 2023, at 12:41 PM, Giannis Economou wrote:
 > > The message appears in "Server Information" section, inside server 
 > > status, in the web interface of BackupPC (page "BackupPC Server Status").
 > > 
 > > Both servers are running BackupPC version 4.4.0.
 > > 1st server says:  "Pool hashing gives 165 repeated files with longest 
 > > chain 1"
 > > 2nd server says:  "Pool hashing gives 5 repeated files with longest chain 
 > > 1"
 > > 
 > > I could not find more info/details about this message in the documentation.
 > > Only a comment in github is mentioning this message, and to my 
 > > understanding it is related to hash collisions in the pool.
 > > 
 > > Since hash collisions might be worrying (in theory), my question is if 
 > > this is alarming and in that case if there is something that we can do 
 > > about it (for example actions to zero out any collisions if needed).
 > > 
 > > 
 > > Thank you very much.
 > 
 > __
 > Rob Sheldon
 > Contract software developer, devops, security, technical lead
 > 
 > 
 > ___
 > BackupPC-users mailing list
 > BackupPC-users@lists.sourceforge.net
 > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > Wiki:https://github.com/backuppc/backuppc/wiki
 > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Files in backup with wrong uid/gid

2023-03-16 Thread backuppc
Good detective work and glad you figured it out!

Looking back, it seems like v3 didn't require '--super'.
Presumably, since v3 used a custom/hacked Perl version of rsync on the
server side, Craig had coded the functionality of '--super' into the
Perl code.

David Heap via BackupPC-users wrote at about 09:29:50 + on Thursday, March 
16, 2023:
 > Resolved, thank you to everyone who replied.
 > 
 > > What transport protocol are you using?
 > > Can you reproduce this when the file is moved elsewhere?
 > 
 > These pointed me in the right direction, thanks - I copied the folder to a 
 > new location and all the files were backed up as root:root. Having a further 
 > look around this and other backups it appeared that anything with a ctime 
 > after we upgraded from BackupPC v3 to v4 a while back wasn't getting the 
 > correct ownership. Anything with a ctime before the upgrade was fine, even 
 > if the contents/mtime was updated afterwards.
 > 
 > It looks like the rsync arguments we used on v3 did not include --super for 
 > some reason, which is required for --owner to work. I think we let BackupPC 
 > upgrade our config file but we don't have the upgrade logs any longer to 
 > confirm. It looks like it was never added to the command for v4 so rsync has 
 > been unable to transfer ownership info for new files. I'm not sure how it 
 > worked on v3  but I'm just glad it's working now we've added --super into 
 > the arguments.
 > 
 > Initial full-server restore tests after the upgrade worked fine as all the 
 > service files existed pre-upgrade and kept the correct ownership. It's taken 
 > a while for a new file to be involved enough in booting/service startup to 
 > flag up the issue.
 > 
 > Thanks,
 > David
 > 
 > 
 > 
 > 
 > The Networking People (TNP) Limited. Registered office: Network House, Caton 
 > Rd, Lancaster, LA1 3PE. Registered in England & Wales with company number: 
 > 07667393
 > 
 > This email and any files transmitted with it are confidential and intended 
 > solely for the use of the individual or entity to whom they are addressed. 
 > If you have received this email in error please notify the system manager. 
 > This message contains confidential information and is intended only for the 
 > individual named. If you are not the named addressee you should not 
 > disseminate, distribute or copy this e-mail. Please notify the sender 
 > immediately by e-mail if you have received this e-mail by mistake and delete 
 > this e-mail from your system. If you are not the intended recipient you are 
 > notified that disclosing, copying, distributing or taking any action in 
 > reliance on the contents of this information is strictly prohibited.
 > 
 > 
 > ___
 > BackupPC-users mailing list
 > BackupPC-users@lists.sourceforge.net
 > List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
 > Wiki:https://github.com/backuppc/backuppc/wiki
 > Project: https://backuppc.github.io/backuppc/


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Files in backup with wrong uid/gid

2023-03-16 Thread David Heap via BackupPC-users
Resolved, thank you to everyone who replied.

> What transport protocol are you using?
> Can you reproduce this when the file is moved elsewhere?

These pointed me in the right direction, thanks - I copied the folder to a new 
location and all the files were backed up as root:root. Having a further look 
around this and other backups it appeared that anything with a ctime after we 
upgraded from BackupPC v3 to v4 a while back wasn't getting the correct 
ownership. Anything with a ctime before the upgrade was fine, even if the 
contents/mtime was updated afterwards.

It looks like the rsync arguments we used on v3 did not include --super for 
some reason, which is required for --owner to work. I think we let BackupPC 
upgrade our config file but we don't have the upgrade logs any longer to 
confirm. It looks like it was never added to the command for v4 so rsync has 
been unable to transfer ownership info for new files. I'm not sure how it 
worked on v3  but I'm just glad it's working now we've added --super into the 
arguments.

Initial full-server restore tests after the upgrade worked fine as all the 
service files existed pre-upgrade and kept the correct ownership. It's taken a 
while for a new file to be involved enough in booting/service startup to flag 
up the issue.

Thanks,
David




The Networking People (TNP) Limited. Registered office: Network House, Caton 
Rd, Lancaster, LA1 3PE. Registered in England & Wales with company number: 
07667393

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Giannis Economou

All is clear now.
Thank you so much for your detailed reply.


On 16/3/2023 8:54 π.μ., Rob Sheldon wrote:

There is no reason to be concerned. This is normal.

Searching for "backuppc pool hashing chain" finds 
https://sourceforge.net/p/backuppc/mailman/message/20671583/ for example. The documentation could 
perhaps be more specific about this, but BackupPC "chains" hash collisions, so no data is 
lost or damaged. This is a pretty standard compsci data structure. If you search the BackupPC 
repository for the message you're seeing, you find 
https://github.com/backuppc/backuppc/blob/43c91d83d06b4af3760400365da62e1fd5ee584e/lib/BackupPC/Lang/en.pm#L311,
 which gives you a couple of variable names that you can use to search the codebase and satisfy 
your curiosity.

"165 repeated files with longest chain 1" just means that there are 165 
different files that had a digest (hash) that matched another file. This can happen for a 
number of different reasons. A common one is that there are identical files on the 
backuppc client system that have different paths.

I've been running BackupPC for a *long* time and have always had chained hashes 
in my server status. It's an informational message, not an error.

On Tue, Mar 14, 2023, at 12:41 PM, Giannis Economou wrote:

The message appears in "Server Information" section, inside server
status, in the web interface of BackupPC (page "BackupPC Server Status").

Both servers are running BackupPC version 4.4.0.
1st server says:  "Pool hashing gives 165 repeated files with longest
chain 1"
2nd server says:  "Pool hashing gives 5 repeated files with longest chain 1"

I could not find more info/details about this message in the documentation.
Only a comment in github is mentioning this message, and to my
understanding it is related to hash collisions in the pool.

Since hash collisions might be worrying (in theory), my question is if
this is alarming and in that case if there is something that we can do
about it (for example actions to zero out any collisions if needed).


Thank you very much.

__
Rob Sheldon
Contract software developer, devops, security, technical lead


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/



___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/


Re: [BackupPC-users] Pool hashing gives XXX repeated files with longest chain 1

2023-03-16 Thread Rob Sheldon
There is no reason to be concerned. This is normal.

Searching for "backuppc pool hashing chain" finds 
https://sourceforge.net/p/backuppc/mailman/message/20671583/ for example. The 
documentation could perhaps be more specific about this, but BackupPC "chains" 
hash collisions, so no data is lost or damaged. This is a pretty standard 
compsci data structure. If you search the BackupPC repository for the message 
you're seeing, you find 
https://github.com/backuppc/backuppc/blob/43c91d83d06b4af3760400365da62e1fd5ee584e/lib/BackupPC/Lang/en.pm#L311,
 which gives you a couple of variable names that you can use to search the 
codebase and satisfy your curiosity.

"165 repeated files with longest chain 1" just means that there are 165 
different files that had a digest (hash) that matched another file. This can 
happen for a number of different reasons. A common one is that there are 
identical files on the backuppc client system that have different paths.

I've been running BackupPC for a *long* time and have always had chained hashes 
in my server status. It's an informational message, not an error.

On Tue, Mar 14, 2023, at 12:41 PM, Giannis Economou wrote:
> The message appears in "Server Information" section, inside server 
> status, in the web interface of BackupPC (page "BackupPC Server Status").
> 
> Both servers are running BackupPC version 4.4.0.
> 1st server says:  "Pool hashing gives 165 repeated files with longest 
> chain 1"
> 2nd server says:  "Pool hashing gives 5 repeated files with longest chain 1"
> 
> I could not find more info/details about this message in the documentation.
> Only a comment in github is mentioning this message, and to my 
> understanding it is related to hash collisions in the pool.
> 
> Since hash collisions might be worrying (in theory), my question is if 
> this is alarming and in that case if there is something that we can do 
> about it (for example actions to zero out any collisions if needed).
> 
> 
> Thank you very much.

__
Rob Sheldon
Contract software developer, devops, security, technical lead


___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:https://github.com/backuppc/backuppc/wiki
Project: https://backuppc.github.io/backuppc/