Thanks for interesting summary, Tom.

This only a part of the data. This only show the imports during the first 1 or 
2 years of the project. For recent imports, we've been using the source records 
field. Combining both of these would give more accurate results.

Also, one thing to remember is that there could be repetitions. There is plenty 
of chance that 2 records from different sources, but mapped to the same edition.

Anand

On 08-Mar-2013, at 10:31 AM, Tom Morris wrote:

> Here's a quick analysis of the dump that Anand kindly made available:
> 
> %     Cum %   # Records       Source  Notes
> 31%   31%     7,029,035       marc_records_scriblio_net       Library of 
> Congress records via Scriblio and Plymouth State University
> 27%   58%     6,182,687       amazon  Amazon web crawl
> 13%   72%     3,021,901       talis_openlibrary_contribution  Talis 
> contribution
> 11%   83%     2,536,583       marc_university_of_toronto      
> 4%    87%     889,009 marc_oregon_summit_records      Consortium of several 
> libraries
> 3%    89%     605,925 marc_miami_univ_ohio    
> 2%    92%     533,279 marc_loc_updates        Library of Congress update 
> service 2010-2012
> 2%    94%     465,994 ia      Internet Archive scanning projects
> 1%    95%     334,779 marc_laurentian 
> 1%    97%     320,925 bcl_marc        Boston College
> 1%    98%     224,762 marc_western_washington_univ    
> 1%    99%     203,740 SanFranPLnn     San Francisco public libraries
> 1%    99%     138,028 marc_binghamton_univ    
> 0%    99%     62,248  hollis_marc     
> 0%    100%    48,714  bpl_marc        Boston Public Library (also contributes 
> through scanning project)
> 0%    100%    31,771  wfm_bk_marc     
> 0%    100%    14,635  marc_ithaca_college     
> 0%    100%    8,193   marc_cca        
> 0%    100%    7,998   CollingswoodLibraryMarcDump10-27-2008   
> 0%    100%    5,338   unc_catalog_marc        
> 100%  
> 22,665,544    
> 
> 
> Vetting the top 10 sources would get us over 95% and the top 15 would cover 
> 99% of the records.  Of course, by the same token, we could lose millions of 
> records if one of these heavy hitters had a provenance which turned out to be 
> unreliable.
> 
> Tom
> 
> 
> 
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]

_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to