Jason, We liked your fingerprinting idea. We expanded it a bit:

$fingerprints{alternate} = join("\t", $marc{item_form}, $marc{date1}, $marc{record_type},

`$marc{bib_lvl}, $marc{title}, $marc{subtitle}.$marc{subtitlep},`

`$marc{author} ? $marc{author} : '',`

`$marc{audioformat}, $marc{videoformat}, $marc{pubyear},`

`$marc{normalizedisbns}`

); each of these have been "normalized" $marc{title}, $marc{subtitle}.$marc{subtitlep}, $marc{author}

`The ISBN's are heavily normalized. 13 digit ISBN's are stripped of the`

`first three characters (978), and the last character. 10 digit ISBN's`

`are stripped of the last character. Then the whole lot is deduped and`

`sorted.`

On top of the fingerprinting, we changed the way the quality scoring works. We ended up coming up with this scoring algorithm

`1. Count the number of subfields in the 245. Give 100 points each for a`

`maximum of 400 points`

`2. Count the number of characters in the 100. Assign 1 point for each`

`character for a maximum of 150 points`

`3. Count the number of characters in the 110. Assign 1 point for each`

`character for a maximum of 150 points`

`4. Count the number of 6XX fields. Assign 50 points to each one for a`

`maximum of 200 points`

`5. Count the number of 02X fields. Assign 50 points to each one for a`

`maximum of 100 points`

`6. Count the number of 246 fields. Assign 100 points to each one for a`

`maximum of 200 points`

`7. Count the number of 130 fields. Assign 100 points to each one for a`

`maximum of 100 points`

`8. Count the number of 010 fields. Assign 100 points to each one for a`

`maximum of 100 points`

`9. Count the number of 490 fields. Assign 100 points to each one for a`

`maximum of 200 points`

`10. Count the number of 830 fields. Assign 10 points to each one for a`

`maximum of 50 points`

`11. Count the number of characters in the 300. Assign .5 points for each`

`character for a maximum of 50 points`

`12. Count the number of 7XX fields. Assign 1 points to each one for a`

`maximum of 100 points`

`13. Count the number of subfields in the 50X. Give 2 points each for a`

`maximum of 100 points`

`14. Count the number of subfields in the 52X. Give 2 points each for a`

`maximum of 100 points`

`15. Count the number of subfields in the 51X,53X,54X,55X,56X,57X,58X.`

`Give .5 points each for a maximum of 500 points`

`Add the score together and we have the "quality" of the MARC. The higher`

`quality wins.`

This approach allowed us to dedupe almost 18% of our bibs in the catalog! -Blake- Conducting Magic MOBIUS On 4/26/2016 1:40 PM, Jason Etheridge wrote:

For what it's worth, this is the fairly conservative algorithm used by the default fingerprinter in the migration-tools repository: https://docs.google.com/document/d/1tvuA0Os3W0B2Fl_GvO_Z6ZG6ZHecg8JtTRMz3QUktK8/edit?usp=sharing Comments welcome.