Hello, Emily --

As a first pass, you may want to create and record checksums for all the files 
on the hard drive, then examine which checksums are identical.  Those files 
will be bit-for-bit exact copies of each other, and can be safely deduped.

This technique won't catch the files where the content is substantially the 
same, except for insignificant changes (an embedded date stamp, for example), 
but it may get you some ways down the path.

-- Scott

-- 
Scott Prater
Digital Library Architect
UW Digital Collections Center
University of Wisconsin - Madison

-----Original Message-----
From: Code for Libraries <CODE4LIB@LISTS.CLIR.ORG> On Behalf Of Emily Lavins
Sent: Monday, October 23, 2023 10:12 AM
To: CODE4LIB@LISTS.CLIR.ORG
Subject: [CODE4LIB] Deduping with finesse

Hello Code4Lib,

I received a question about deduping from one of our archivists and I'm 
wondering if anyone has any experience/recommendations for this sort of thing.

In short: We received a hard drive that has massive amounts of duplicates, and 
they are starting the process of deduping and arranging it. They want somewhat 
finer control over which duplicates get retained (currently using FSlint and 
Bitcurator), so they can ensure 'complete sets' of files are retained. But it'd 
be great to not have to manually select *every* dedup preference in FSlint.

For example:
1. There is at least one folder that contains numbered audio tracks. When we 
ran fslint raw, a few of these got deduped in favor of other copies in the 
filesystem. But it would have been preferred to keep these together.
2. If there is a directory in which most of the working files were originally 
created together.
3. We'd also generally prefer to keep the copies that will *not* result in, 
post-dedup, folders containing only a single file scattered throughout the 
directory.

Hopefully some of that makes sense. Has anyone found any helpful workflows for 
streamlining the deduping/arranging process?

All I could come up with is logging all of FSlint's decisions, so that any 
undesirable dedups could be more easily be tracked/reversed later, but I really 
just don't know enough about any of this.

Thank you very much for your time and thoughts.

All the best,
Emily


--
Emily Lavins
Associate Systems Librarian
Boston College Libraries

Reply via email to