Re: [CODE4LIB] Deduping with finesse

Jess Whyte Thu, 26 Oct 2023 07:40:11 -0700

Hi, 

Our process isn't very finessed, but right now it's:


-run brunnhilde on the files to get reports, in this case we want the 
duplicates.csv report

-if requested, the archivist does manual review of that csv to decide which 
copy of the file should be retained. The copy that should be retained should be 
the first one in the list (i.e. if there are 4 copies of a file listed in the 
duplicates.csv, the first one will be kept). I can see though how that would be 
cumbersome if you had hundreds to correct/move. It could likely be scripted if 
they all followed a similar pattern (e.g. were all in the same path). 

-we run a little bash script that reads through that csv line by line. It looks 
at the checksum, if the checksum does not match the previous checksum in the 
list, it continues. If it does match the previous checksum in the list, then it 
moves that file to a Duplicates directory and logs that move, e.g. "MOVED 
<originalFilePath> to <directory>"

And, like others have said, none of this captures close duplicates.

Re: [CODE4LIB] Deduping with finesse

Reply via email to