Jason,

We liked your fingerprinting idea. We expanded it a bit:

 $fingerprints{alternate} = join("\t",
      $marc{item_form}, $marc{date1}, $marc{record_type},
$marc{bib_lvl}, $marc{title}, $marc{subtitle}.$marc{subtitlep}, $marc{author} ? $marc{author} : '', $marc{audioformat}, $marc{videoformat}, $marc{pubyear}, $marc{normalizedisbns}
      );

each of these have been "normalized"
$marc{title}, $marc{subtitle}.$marc{subtitlep}, $marc{author}

The ISBN's are heavily normalized. 13 digit ISBN's are stripped of the first three characters (978), and the last character. 10 digit ISBN's are stripped of the last character. Then the whole lot is deduped and sorted.

On top of the fingerprinting, we changed the way the quality scoring works.
We ended up coming up with this scoring algorithm

1. Count the number of subfields in the 245. Give 100 points each for a maximum of 400 points 2. Count the number of characters in the 100. Assign 1 point for each character for a maximum of 150 points 3. Count the number of characters in the 110. Assign 1 point for each character for a maximum of 150 points 4. Count the number of 6XX fields. Assign 50 points to each one for a maximum of 200 points 5. Count the number of 02X fields. Assign 50 points to each one for a maximum of 100 points 6. Count the number of 246 fields. Assign 100 points to each one for a maximum of 200 points 7. Count the number of 130 fields. Assign 100 points to each one for a maximum of 100 points 8. Count the number of 010 fields. Assign 100 points to each one for a maximum of 100 points 9. Count the number of 490 fields. Assign 100 points to each one for a maximum of 200 points 10. Count the number of 830 fields. Assign 10 points to each one for a maximum of 50 points 11. Count the number of characters in the 300. Assign .5 points for each character for a maximum of 50 points 12. Count the number of 7XX fields. Assign 1 points to each one for a maximum of 100 points 13. Count the number of subfields in the 50X. Give 2 points each for a maximum of 100 points 14. Count the number of subfields in the 52X. Give 2 points each for a maximum of 100 points 15. Count the number of subfields in the 51X,53X,54X,55X,56X,57X,58X. Give .5 points each for a maximum of 500 points

Add the score together and we have the "quality" of the MARC. The higher quality wins.

This approach allowed us to dedupe almost 18% of our bibs in the catalog!


-Blake-
Conducting Magic
MOBIUS

On 4/26/2016 1:40 PM, Jason Etheridge wrote:
For what it's worth, this is the fairly conservative algorithm used by
the default fingerprinter in the migration-tools repository:

https://docs.google.com/document/d/1tvuA0Os3W0B2Fl_GvO_Z6ZG6ZHecg8JtTRMz3QUktK8/edit?usp=sharing

Comments welcome.


Reply via email to