Hi Elvi,
Anurag (from Google Scholar) replied to my question to him with:
As always, the devil is often in the details.
http://repository.seafdec.org.ph/bitstream/10862/124/1/adsea94p037-062.pdf
I looked at our crawl logs. Looks like the PDF file redirected to
somewhere else (using HTTP 301) and the crawler got no document. Which
meant the citation_pdf_url entry had no effect. The only document that
was seen associated with the record was the txt version.
Looking further, similar redirects were returned for quite a few PDF
versions. Eg:
http://repository.seafdec.org.ph/bitstream/10862/697/1/AFNv14n04-05.pdf
http://repository.seafdec.org.ph/bitstream/10862/801/1/AFNv10n01a.pdf
http://repository.seafdec.org.ph/bitstream/10862/812/1/HILITE98.pdf
http://repository.seafdec.org.ph/bitstream/10862/862/1/techrept03_pp21-38.pdf
Right now, the repository is down so I can't check further.
The redirect that he is mentioning is that a hit to:
/bitstream/123/456/1/document.pdf
gets HTTP-301 redirected to
/bitstream/handle/123/456/document.pdf?sequence=1
I'm not sure what the guidance from Scholar is on the number of redirects
to follow, or if perhaps something incorrect happened (he mentions that the
crawler got no document).
That said, I think we should try to minimize the number of redirects, and
have a consistent bitstream serving url.
I did notice something with your PDF's and this might be a red-herring, but
it appears that a citation version of the PDF is generated on the fly for
every request. I don't know if that affects the search index results. Or,
if perhaps that document generation failed (or timed out), and caused
Scholar to not get a document.
Peter Dietz
On Mon, Oct 8, 2012 at 1:58 PM, Peter Dietz <pdiet...@gmail.com> wrote:
> Hi Elvi,
>
> That is strange..
>
>
> I had to go re-read Google Scholar's Inclusion Guidelines, and I see an
> issue.. (not with you, but with DSpace, and with Google's requirement). I
> remember discussing this with Anurag (from Google Scholar), and I thought
> we were on the same page, but your site is clearly evidence of an issue.
>
> The "<meta>" tags normally apply only to the exact page on which they're
> provided. If this page shows only the abstract of the paper and you have
> the full text in a separate file, e.g., in the PDF format, please specify
> the locations of *all* full text versions using citation_pdf_url or
> DC.identifier tags. The content of the tag is the absolute URL of the PDF
> file; *for security reasons, it must refer to a file in the same
> subdirectory as the HTML abstract*.
>
> Failure to link the alternate versions together could result in the
> incorrect indexing of the PDF files, because these files would be processed
> as separate documents without the information contained in the meta tags.
>
> from: http://scholar.google.com/intl/en/scholar/inclusion.html#indexingMy
> bold emphasis added.
>
> Your DSpace Item:
> http://repository.seafdec.org.ph/handle/10862/124
>
> Your citation_pdf_url meta-tag:
> <meta content="
> http://repository.seafdec.org.ph/bitstream/10862/124/1/adsea94p037-062.pdf"
> name="citation_pdf_url" />
>
> "the PDF file [...] must refer to a file in the same subdirectory as the
> HTML abstract"
> http://repository.seafdec.org.ph/handle/10862/124
> http://repository.seafdec.org.ph/bitstream/10862/124/1/adsea94p037-062.pdf
>
> umm, nope.
>
> I'm wondering if Google wants us to restructure some URL's, so we also
> allow something like:
> http://repository.seafdec.org.ph/handle/10862/124*
> /bitstream/1/adsea94p037-062.pdf*
>
> That would pass the requirement of PDF within the HTML subdirectory...
> So, I do have a question for Google, and that is: really???
> (I'll contact them to see how required this is...).
>
> Peter Dietz
>
>
>
> On Thu, Oct 4, 2012 at 9:14 PM, Nemiz, Elvi <esne...@seafdec.org.ph>wrote:
>
>> Dear all,
>>
>> I am just wondering why a lot of extracted text from our items comes up
>> when doing a google search instead of the Bundle: ORIGINAL which are all
>> pdfs. Do I have to set manually the original bundle as the primary
>> bitstream? Or is it already set as the default primary bundle if we only
>> uploaded a single bitstream? Please check this search results
>> http://scholar.google.com/scholar?start=0&q=site:repository.seafdec.org.ph&hl=en&as_sdt=0,5.
>> I want our users to view or download the pdf and not the extracted text
>> from Google scholar search results.
>>
>> Thanks in advance and regards,
>> Elvi S. Nemiz
>> Information Assistant
>> Library and Data Bank Services Section
>> Training and Information Division
>> SEAFDEC Aquaculture Department
>> Tigbauan, Iloilo
>> Philippines
>>
>> Access and download SEAFDEC/AQD publications for FREE
>>
>> http://repository.seafdec.org.ph
>>
>> [SEAFDEC/AQD Institutional Repository (SAIR)]
>>
>> - the official digital repository of scholarly and research information
>> of the department
>>
>>
>> ------------------------------------------------------------------------------
>> Don't let slow site performance ruin your business. Deploy New Relic APM
>> Deploy New Relic app performance management and know exactly
>> what is happening inside your Ruby, Python, PHP, Java, and .NET app
>> Try New Relic at no cost today and get our sweet Data Nerd shirt too!
>> http://p.sf.net/sfu/newrelic-dev2dev
>> _______________________________________________
>> DSpace-tech mailing list
>> DSpace-tech@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>>
>>
>
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech