[Dspace-devel] [DuraSpace JIRA] (DS-1387) Reports that Google Scholar is sometimes linking to DSpace extracted text (*.pdf.txt) files instead of original PDF

Reinhard Engels (DuraSpace JIRA) Wed, 21 Nov 2012 07:23:04 -0800

    [ 
https://jira.duraspace.org/browse/DS-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=26952#comment-26952
 ]


Reinhard Engels commented on DS-1387:
-------------------------------------

Thanks Tim, Richard. Based on what I see in my logs and what I can see in 
google when I limit to site:dash.harvard.edu and filtetype:txt, it does look 
like everything it crawled was what was listed in the mets and ore crosswalks 
(and nothing else).

We took a couple of additional steps yesterday and I'm not sure if it was due 
to these or the turning off the crosswalks last week, but we did see a 
significant reduction in pdf.txt requests.

Here's what we did:

1. added robots.txt directive: Disallow: /*/*.pdf.txt*

2. added "citation_pdf_url" metadata directive in abstract html pages (probably 
a good idea in any case). 

I also a response from our contact at google scholar confirming that robot.txt 
(from their perspective) is the way to go:

"the best way to avoid indexing such urls would be to add the following line 
for all user-agents in your robots.txt.

Disallow: /bitstream/*txt?

Crawlers tend to be pretty comprehensive scanners. They need to be if they are 
to be able to index all the sites with all the diverse structures on the web. 
So, if there is a pathway to a url-space, they usually find it...."

Perhaps the default robots.txt for dspace should be updated accordingly? Though 
come to think of it neither *txt nor *pdf.txt will be 100% sufficient -- the 
first may have false positives (legitimate txt files, though I don't think we 
have any in our repo) the latter will miss some (like html.txt text 
extractions, which we do have). Still, stripping out the pdf.txts from the 
crawlers, if it does in fact work, is a hell of a lot better than the status 
quo. 

I haven't looked at the "resource policy" option yet, Richard -- I'll mull that 
over today as an alternative/additional precaution. Thanks for the suggestion!

                
> Reports that Google Scholar is sometimes linking to DSpace extracted text 
> (*.pdf.txt) files instead of original PDF
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: DS-1387
>                 URL: https://jira.duraspace.org/browse/DS-1387
>             Project: DSpace
>          Issue Type: Bug
>          Components: XMLUI
>            Reporter: Tim Donohue
>
> This ticket is a placeholder for several recent reports about PDF indexing 
> oddities with Google Scholar and DSpace (seemingly XMLUI specific, though 
> that is unconfirmed).  
> In several cases, users have reported that Google Scholar is mistakenly 
> linking to the internal extracted PDF text files (*.pdf.txt files).  These 
> internal ".pdf.txt" files are automatically generated by DSpace for its own 
> indexing, and are not meant to be utilized by external search engines.
> Although the "*.pdf.txt" files are technically publicly accessible, they are 
> currently not linked to from the main Item "splash page", so it's uncertain 
> how they are being located by web spiders. (Some have speculated perhaps form 
> the OAI interface, or from indexing of the XMLUI's "mets.xml" file)
> Here are a few threads describing this issues on dspace-tech mailing list:
> * http://www.mail-archive.com/[email protected]/msg19303.html
> * http://www.mail-archive.com/[email protected]/msg18831.html
> If anyone else has noticed this issue, we'd encourage you to provide examples 
> in this JIRA ticket.  It may help us to better track down whether this is a 
> DSpace issue, a Google Scholar issue, or perhaps even a bit of both.
> When you add comments to this ticket, please provide the DSpace version you 
> are using and whether you are using XMLUI or JSPUI and whether you have OAI 
> enabled.  If you have any examples you can link to in Google Scholar or any 
> other oddities you've noticed, please note those as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

[Dspace-devel] [DuraSpace JIRA] (DS-1387) Reports that Google Scholar is sometimes linking to DSpace extracted text (*.pdf.txt) files instead of original PDF

Reply via email to