[Dspace-tech] Weird problem with Google Harvesting

George Kozak Thu, 24 Sep 2009 08:25:58 -0700

Hi...

We (at Cornell) have discovered a weird problem with Google harvesting 
and wonder if anyone else has seen this.  One of our collections is 
called Cornell Alumni News and it contains PDF versions of archived 
editions of our Alumni News Magazine (dating back to the 1800's).  Each 
of the PDFs contain scanned images with underlining OCR.  Each item is 
one volume with 12-18 bitstreams (PDFs) which represent an issue.


We have found that Google has harvested these issues inconsistently.  
For instance, for one volume we will find that 5 of the bitstreams 
appear in Google but 7 do not. 

Has anyone else seen anything like this?  We are wondering if it may be 
a product of the size of the PDFs.  The ones which weren't harvested 
seem consistently large (around 30MB). 

P.S.  I did go through the Google WebMaster Tools, but I couldn't find 
anything that indicated a problem on their end.

-- 
***************************
George Kozak
Digital Library Specialist
Division of Library Information Technologies (DLIT), Digital Media Group
501 Olin Library
Cornell University
607-255-8924
***************************
[email protected]


------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

[Dspace-tech] Weird problem with Google Harvesting

Reply via email to