Canonical form of schema blobs?

Ian Denhardt Tue, 11 Jun 2019 13:41:12 -0700

Hey all,

I've got a large amount of data that I'm trying to import into perkeep,
from an old home-grown backup system that used hard links to
de-duplicate files based on hashes. As a consequence, pk put is moving
much more slowly than I think is achievable here, since it doesn't know
about the dedup scheme and is scanning each file every time through each
backup. Doing some back of the napkin calculations, at the rate it's
going it's just going to take way to long to be realistic. It got
through half of the data by disk usage pretty quick, and then slowed to
a crawl.


I think I can speed things up dramatically by writing a custom tool that
looks at inodes to determine if a file is already present, but I have a
concern: JSON doesn't have a canonical form, so I worry if I do this
naively perkeep will fail to use the same bit-for-bit representations
for all of the various schema blobs (varying in whitespace for example),
thus unnecessarily duplicating content.

My question is: what would I need to do to make sure that doesn't
happen, i.e. that my one-off tool ends up picking the same formatting
and such as pk put?

Thanks,

-Ian

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/156028539437.1203.10628641923295501163%40localhost.localdomain.
For more options, visit https://groups.google.com/d/optout.

Canonical form of schema blobs?

Reply via email to