Hey all, I've got a large amount of data that I'm trying to import into perkeep, from an old home-grown backup system that used hard links to de-duplicate files based on hashes. As a consequence, pk put is moving much more slowly than I think is achievable here, since it doesn't know about the dedup scheme and is scanning each file every time through each backup. Doing some back of the napkin calculations, at the rate it's going it's just going to take way to long to be realistic. It got through half of the data by disk usage pretty quick, and then slowed to a crawl.
I think I can speed things up dramatically by writing a custom tool that looks at inodes to determine if a file is already present, but I have a concern: JSON doesn't have a canonical form, so I worry if I do this naively perkeep will fail to use the same bit-for-bit representations for all of the various schema blobs (varying in whitespace for example), thus unnecessarily duplicating content. My question is: what would I need to do to make sure that doesn't happen, i.e. that my one-off tool ends up picking the same formatting and such as pk put? Thanks, -Ian -- You received this message because you are subscribed to the Google Groups "Perkeep" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/156028539437.1203.10628641923295501163%40localhost.localdomain. For more options, visit https://groups.google.com/d/optout.
