[
https://jira.duraspace.org/browse/DS-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=27674#comment-27674
]
Tim Donohue commented on DS-1387:
---------------------------------
Hi all,
I had a brief discussion with Anurag Acharya of Google Scholar about this issue
today.
His suggestion is to either:
A) Make *.pdf.txt files (and similar text-extracted files) inaccessible to the
public (i.e. return HTTP 404). Only make them accessible to our internal
system so that they can be indexed by Solr/Lucene. The Google Scholar spiders
always end up finding "unexpected" things if they are publicly accessible.
OR
B) Disallow access via a robots.txt (as Reinhard mentions in the previous
comment)
His recommendation is to go with Option A, if possible. He's seen too many
cases where people accidentally make their robots.txt *too restrictive* and
block things they never meant too (similar to what Reinhard mentions).
> Reports that Google Scholar is sometimes linking to DSpace extracted text
> (*.pdf.txt) files instead of original PDF
> -------------------------------------------------------------------------------------------------------------------
>
> Key: DS-1387
> URL: https://jira.duraspace.org/browse/DS-1387
> Project: DSpace
> Issue Type: Bug
> Components: XMLUI
> Reporter: Tim Donohue
>
> This ticket is a placeholder for several recent reports about PDF indexing
> oddities with Google Scholar and DSpace (seemingly XMLUI specific, though
> that is unconfirmed).
> In several cases, users have reported that Google Scholar is mistakenly
> linking to the internal extracted PDF text files (*.pdf.txt files). These
> internal ".pdf.txt" files are automatically generated by DSpace for its own
> indexing, and are not meant to be utilized by external search engines.
> Although the "*.pdf.txt" files are technically publicly accessible, they are
> currently not linked to from the main Item "splash page", so it's uncertain
> how they are being located by web spiders. (Some have speculated perhaps form
> the OAI interface, or from indexing of the XMLUI's "mets.xml" file)
> Here are a few threads describing this issues on dspace-tech mailing list:
> * http://www.mail-archive.com/[email protected]/msg19303.html
> * http://www.mail-archive.com/[email protected]/msg18831.html
> If anyone else has noticed this issue, we'd encourage you to provide examples
> in this JIRA ticket. It may help us to better track down whether this is a
> DSpace issue, a Google Scholar issue, or perhaps even a bit of both.
> When you add comments to this ticket, please provide the DSpace version you
> are using and whether you are using XMLUI or JSPUI and whether you have OAI
> enabled. If you have any examples you can link to in Google Scholar or any
> other oddities you've noticed, please note those as well.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel