[
https://jira.duraspace.org/browse/DS-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=26893#comment-26893
]
Reinhard Engels commented on DS-1387:
-------------------------------------
Here's why I'm convinced it has something to do with OAI (and the mets and ore
crosswalks in particular, which expose the pdf.txt links): the only pdf.txt
records google is crawling are the first 100 records in each collection -- the
100 records that are listed in the first page of OAI results (googlebot doesn't
seem to be following the resumption token, thankfully). Although intriguing,
your item meta data suggestion, Claudia, doesn't line up with what I'm seeing
in my logs -- 0 hits this month at least except for my testing just now. Also I
would expect to see pdf.txt links for the entire repo in that case instead of
just the first 100 for each collection. Finally, I can see that the crawling of
pdf.txt files began right after the first time googlebot crawled the mets and
ore crosswalks in September (though both had been available long before then -
googlebot just hadn't been interested).
We turned off the ore and mets crosswalks yesterday, but I'm nervous that the
existing pdf.txt records in scholar won't be removed because now that google
has the links, even if they're "orphans," they are valid (return http 200).
We're considering modifying our robots.txt to forbid crawling of these files,
or even writing redirects to the pdfs for pdf.txt requests, but I'm not sure
the former will be sufficient (sounds like it wasn't for you, Dan), and the
latter strikes me as nasty and potentially dangerous.
Just out of curiosity, why are these extracted text files exposed via http to
begin with? Aren't they intended just to be used by the internal search
function?
Also, anyone have have a good contact at google scholar? I fed some forms with
desperate pleas and shot some emails off, but no response yet. If I'm right
about the crosswalk stuff being at the bottom of this, it looks like the
problem started because of configuration tweak on their end, and perhaps
another tweak on their end could fix it.
> Reports that Google Scholar is sometimes linking to DSpace extracted text
> (*.pdf.txt) files instead of original PDF
> -------------------------------------------------------------------------------------------------------------------
>
> Key: DS-1387
> URL: https://jira.duraspace.org/browse/DS-1387
> Project: DSpace
> Issue Type: Bug
> Components: XMLUI
> Reporter: Tim Donohue
>
> This ticket is a placeholder for several recent reports about PDF indexing
> oddities with Google Scholar and DSpace (seemingly XMLUI specific, though
> that is unconfirmed).
> In several cases, users have reported that Google Scholar is mistakenly
> linking to the internal extracted PDF text files (*.pdf.txt files). These
> internal ".pdf.txt" files are automatically generated by DSpace for its own
> indexing, and are not meant to be utilized by external search engines.
> Although the "*.pdf.txt" files are technically publicly accessible, they are
> currently not linked to from the main Item "splash page", so it's uncertain
> how they are being located by web spiders. (Some have speculated perhaps form
> the OAI interface, or from indexing of the XMLUI's "mets.xml" file)
> Here are a few threads describing this issues on dspace-tech mailing list:
> * http://www.mail-archive.com/[email protected]/msg19303.html
> * http://www.mail-archive.com/[email protected]/msg18831.html
> If anyone else has noticed this issue, we'd encourage you to provide examples
> in this JIRA ticket. It may help us to better track down whether this is a
> DSpace issue, a Google Scholar issue, or perhaps even a bit of both.
> When you add comments to this ticket, please provide the DSpace version you
> are using and whether you are using XMLUI or JSPUI and whether you have OAI
> enabled. If you have any examples you can link to in Google Scholar or any
> other oddities you've noticed, please note those as well.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel