Here's a quick analysis of the dump that Anand kindly made available: *%* *Cum %* *# Records* *Source* *Notes* 31% 31% 7,029,035 marc_records_scriblio_net Library of Congress records via Scriblio and Plymouth State University 27% 58% 6,182,687 amazon Amazon web crawl 13% 72% 3,021,901 talis_openlibrary_contribution Talis contribution 11% 83% 2,536,583 marc_university_of_toronto 4% 87% 889,009 marc_oregon_summit_records Consortium of several libraries 3% 89% 605,925 marc_miami_univ_ohio 2% 92% 533,279 marc_loc_updates Library of Congress update service 2010-2012 2% 94% 465,994 ia Internet Archive scanning projects 1% 95% 334,779 marc_laurentian 1% 97% 320,925 bcl_marc Boston College 1% 98% 224,762 marc_western_washington_univ 1% 99% 203,740 SanFranPLnn San Francisco public libraries 1% 99% 138,028 marc_binghamton_univ 0% 99% 62,248 hollis_marc 0% 100% 48,714 bpl_marc Boston Public Library (also contributes through scanning project) 0% 100% 31,771 wfm_bk_marc 0% 100% 14,635 marc_ithaca_college 0% 100% 8,193 marc_cca 0% 100% 7,998 CollingswoodLibraryMarcDump10-27-2008 0% 100% 5,338 unc_catalog_marc 100% 22,665,544
Vetting the top 10 sources would get us over 95% and the top 15 would cover 99% of the records. Of course, by the same token, we could lose millions of records if one of these heavy hitters had a provenance which turned out to be unreliable. Tom
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
