[ 
https://jira.duraspace.org/browse/DS-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=26868#comment-26868
 ] 

Tim Donohue commented on DS-1387:
---------------------------------

This issue was discussed in the DSpace Developers meeting on Nov 14 (starts at 
[20:45]): http://irclogs.duraspace.org/index.php?date=2012-11-14

Here's a basic summary of what we've been able to find so far:
* All early examples seem to be using DSpace XMLUI
* The issue may or may not be related to the fact that the XMLUI always 
performs a 301 Redirect of the "citation_pdf_url" to the correct XMLUI path.  
This 301 Redirect is always performed in order to allow the same 
"citation_pdf_url" to be utilized by the JSPUI & the XMLUI.  It's code is here: 
https://github.com/DSpace/DSpace/blob/master/dspace-xmlui/src/main/webapp/sitemap.xmap#L355
* We're uncertain how Google Scholar is finding the ".pdf.txt" documents.  They 
are not directly linked to anywhere on the Item Page and NEVER appear as the 
"citation_pdf_url". But, they are accessible through the OAI-PMH interface, or 
through the "mets.xml" document that the XMLUI uses to generate its final HTML 
output.
* Some have seen log evidence that other crawlers also seem to index the 
".pdf.txt" documents.  Again, we're not entirely sure how they are getting 
there.
* We may need more real life examples to determine exactly what is going 
on..and how best to resolve it.
                
> Reports that Google Scholar is sometimes linking to DSpace extracted text 
> (*.pdf.txt) files instead of original PDF
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: DS-1387
>                 URL: https://jira.duraspace.org/browse/DS-1387
>             Project: DSpace
>          Issue Type: Bug
>          Components: XMLUI
>            Reporter: Tim Donohue
>
> This ticket is a placeholder for several recent reports about PDF indexing 
> oddities with Google Scholar and DSpace (seemingly XMLUI specific, though 
> that is unconfirmed).  
> In several cases, users have reported that Google Scholar is mistakenly 
> linking to the internal extracted PDF text files (*.pdf.txt files).  These 
> internal ".pdf.txt" files are automatically generated by DSpace for its own 
> indexing, and are not meant to be utilized by external search engines.
> Although the "*.pdf.txt" files are technically publicly accessible, they are 
> currently not linked to from the main Item "splash page", so it's uncertain 
> how they are being located by web spiders. (Some have speculated perhaps form 
> the OAI interface, or from indexing of the XMLUI's "mets.xml" file)
> Here are a few threads describing this issues on dspace-tech mailing list:
> * http://www.mail-archive.com/[email protected]/msg19303.html
> * http://www.mail-archive.com/[email protected]/msg18831.html
> If anyone else has noticed this issue, we'd encourage you to provide examples 
> in this JIRA ticket.  It may help us to better track down whether this is a 
> DSpace issue, a Google Scholar issue, or perhaps even a bit of both.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to