[Dspace-devel] [DuraSpace JIRA] (DS-1387) Reports that Google Scholar is sometimes linking to DSpace extracted text (*.pdf.txt) files instead of original PDF

Andrea Schweer (DuraSpace JIRA) Wed, 14 Nov 2012 18:05:07 -0800

    [ 
https://jira.duraspace.org/browse/DS-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=26873#comment-26873
 ]


Andrea Schweer commented on DS-1387:
------------------------------------

I have four public repositories, all on (heavily customised) 1.8.2 XMLUI. OAI 
is enabled for all four. The themes are different but all are based on Mirage 
(including the comment that has the path to the mets.xml file). We run 
customised usage stats that don't expose .pdf.txt links.

Google (not Scholar) indexing of text files:
Unitec Research Bank (http://unitec.researchbank.ac.nz) has no text files 
indexed by Google at all. Otago OUR Archive (http://otago.ourarchive.ac.nz) has 
the license.txt files indexed. AUT Scholarly Commons 
(http://aut.researchgateway.ac.nz) and Waikato Research Commons 
(http://researchcommons.waikato.ac.nz) have license.txt and .pdf.txt indexed. 

Bing indexing of .pdf.txt files:
Same as Google's indexing of .pdf.txt

Google Scholar indexing of text files:
Both repositories that have .pdf.txt files indexed in Google have entries in 
Google Scholar that link to the .pdf.txt file in the search result, for example 
http://scholar.google.co.nz/scholar?hl=en&q=Using+Information+and+Communication+Technology+to+Facilitate+Supply+Chain+Management+in+the+New+Zealand+Construction+Industry&btnG=&as_sdt=1%2C5&as_sdtp=
 and 
http://scholar.google.co.nz/scholar?q=Re-introducing+honey+in+the+management+of+wounds+and+ulcers-theory+and+practice&btnG=&hl=en&as_sdt=0%2C5
 (both give the ?sequence= version of the bitstream link).
In both cases, when you follow the "All n versions" link, the repository 
version links to the item splash page rather than to the PDF or the .pdf.txt
In both cases, the PDF files are well under the 5MB limit that Google Scholar 
mentions in the crawling guidelines: 
http://scholar.google.co.nz/intl/en/scholar/inclusion.html#crawl

I see no access to the mets.xml files on any of the four servers in the last 
month, nor any OAI requests with metadataPrefix=ore. However, when I use the 
"last changed" filters in Google (not Scholar), it looks like they were all 
harvested no later than January/February 2012. As far as I can tell, nothing 
relevant changed at that point. I don't have access logs from back then.

The two repositories with .pdf.txt indexed have their OAI interface at 
/dspace-oai, the other two have theirs at /oai.

I see a whole range of search engine crawlers (Googlebot, Baiduspider, yandex, 
bingbot, majestic12) accessing .pdf.txt URLs for the two repositories whose 
.pdf.txt files are indexed. Both bitstream URL versions are used, ?sequence=n 
and /bitstream/prefix/handle/sequence/name.

The one repository that has neither license.txt nor .pdf.txt files indexed has 
the full item view disabled for anonymous access.

This is all a bit random but I don't know what information might be useful. I'm 
wondering where the search engines are even getting the .pdf.txt URLs, and then 
of course why Scholar would prefer these over the PDF files when the Google 
Scholar guidelines say they want PDF or HTML.
                
> Reports that Google Scholar is sometimes linking to DSpace extracted text 
> (*.pdf.txt) files instead of original PDF
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: DS-1387
>                 URL: https://jira.duraspace.org/browse/DS-1387
>             Project: DSpace
>          Issue Type: Bug
>          Components: XMLUI
>            Reporter: Tim Donohue
>
> This ticket is a placeholder for several recent reports about PDF indexing 
> oddities with Google Scholar and DSpace (seemingly XMLUI specific, though 
> that is unconfirmed).  
> In several cases, users have reported that Google Scholar is mistakenly 
> linking to the internal extracted PDF text files (*.pdf.txt files).  These 
> internal ".pdf.txt" files are automatically generated by DSpace for its own 
> indexing, and are not meant to be utilized by external search engines.
> Although the "*.pdf.txt" files are technically publicly accessible, they are 
> currently not linked to from the main Item "splash page", so it's uncertain 
> how they are being located by web spiders. (Some have speculated perhaps form 
> the OAI interface, or from indexing of the XMLUI's "mets.xml" file)
> Here are a few threads describing this issues on dspace-tech mailing list:
> * http://www.mail-archive.com/[email protected]/msg19303.html
> * http://www.mail-archive.com/[email protected]/msg18831.html
> If anyone else has noticed this issue, we'd encourage you to provide examples 
> in this JIRA ticket.  It may help us to better track down whether this is a 
> DSpace issue, a Google Scholar issue, or perhaps even a bit of both.
> When you add comments to this ticket, please provide the DSpace version you 
> are using and whether you are using XMLUI or JSPUI and whether you have OAI 
> enabled.  If you have any examples you can link to in Google Scholar or any 
> other oddities you've noticed, please note those as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

[Dspace-devel] [DuraSpace JIRA] (DS-1387) Reports that Google Scholar is sometimes linking to DSpace extracted text (*.pdf.txt) files instead of original PDF

Reply via email to