Here's a quick analysis of the dump that Anand kindly made available:

  *%* *Cum %* *# Records* *Source* *Notes*  31% 31% 7,029,035
marc_records_scriblio_net Library of Congress records via Scriblio and
Plymouth State University  27% 58% 6,182,687 amazon Amazon web crawl  13%
72% 3,021,901 talis_openlibrary_contribution Talis contribution  11% 83%
2,536,583 marc_university_of_toronto
 4% 87% 889,009 marc_oregon_summit_records Consortium of several libraries
3% 89% 605,925 marc_miami_univ_ohio
 2% 92% 533,279 marc_loc_updates Library of Congress update service
2010-2012  2% 94% 465,994 ia Internet Archive scanning projects  1% 95%
334,779 marc_laurentian
 1% 97% 320,925 bcl_marc Boston College  1% 98% 224,762
marc_western_washington_univ
 1% 99% 203,740 SanFranPLnn San Francisco public libraries  1% 99% 138,028
marc_binghamton_univ
 0% 99% 62,248 hollis_marc
 0% 100% 48,714 bpl_marc Boston Public Library (also contributes through
scanning project)  0% 100% 31,771 wfm_bk_marc
 0% 100% 14,635 marc_ithaca_college
 0% 100% 8,193 marc_cca
 0% 100% 7,998 CollingswoodLibraryMarcDump10-27-2008
 0% 100% 5,338 unc_catalog_marc
 100%
22,665,544


Vetting the top 10 sources would get us over 95% and the top 15 would cover
99% of the records.  Of course, by the same token, we could lose millions
of records if one of these heavy hitters had a provenance which turned out
to be unreliable.

Tom
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to