Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search Engine inclusion implications

Hilton Gibson Thu, 18 Jun 2015 09:54:06 -0700

Hi Tim,

The use case is very important. People download DF's and do not remember
where it came from. More importantly the PDF itself usually has no
permanent identifiers to help with citations.


Why is Google concentrating on extracting metadata from the PDF files. The
PDF format is not standard to start with, secondly DSpace already does
this. So just expose the metadata DSpace extracts to Google.

It is simply good academic research practice to be able to identify your
sources.

Perhaps add a warning about Google and suggest putting the "cover page" at
the back of the PDF.
See: http://wiki.lib.sun.ac.za/index.php/SUNScholar/PDF_Cover_Page/5.X for
our config.

I suggest we enlighten Google about good research practice instead.

Regards

hg



*Hilton Gibson*
Ubuntu Linux Systems Administrator
Stellenbosch University Library
http://staff.lib.sun.ac.za/~hgibson/docs/cv/cv.html


On 18 June 2015 at 18:23, Tim Donohue <tdono...@duraspace.org> wrote:

> Hi All,
>
> If you attended the Open Repositories 2015 (or followed along remotely),
> you may have heard about the "Indexing Repositories: Pitfalls and Best
> Practices" talk given by Anurag Acharya (co-creator of Google Scholar.
>
> If you haven't yet seen the talk, the slides are available at:
>
> http://www.or2015.net/wp-content/uploads/2015/06/or-2015-anurag-google-scholar.pdf
>
> The video should be available from the OR15 website in the coming weeks.
>
> One of the common indexing "pitfalls" mentioned by Anurag was
> automatically inserting PDF Cover Pages into PDFs. From what I can
> recall, there's a few reasons this can be problematic:
>
> 1. Google Scholar (and possibly other search engines) attempts to
> extract metadata from the text of PDF (using some language processing
> and format identification techniques). This metadata includes
> auto-extracting title, abstract and author information from PDFs.
> Unfortunately, the addition of a PDF coverpage often breaks this
> metadata extraction, which may result in the document not appearing in
> Google Scholar.
>
> 2. If all the PDF cover pages in your site look nearly identical (or
> completely identical), the Google Scholar indexer (and again possibly
> others) may wrongly flag the site for "cloaking" [1]. Essentially, it
> detects something is "fishy" as all the documents look very similar.
> This may result in the removal of the entire site from Google Scholar.
>
> So, to get to my question. In DSpace 5.0, we actually added a basic PDF
> Cover Page capability (which was requested by DCAT and others):
> https://wiki.duraspace.org/display/DSDOC5x/PDF+Citation+Cover+Page
>
> As this may have strong implications for inclusion in Google Scholar,
> should we consider removing this functionality from DSpace?
>
> For the time being, I've placed warnings in the Documentation for this
> feature to try to dissuade institutions from enabling it if Google
> Scholar inclusion is of high importance.
>
> This isn't really a technical issue (as we can easily remove code). But,
> I am interested in feedback from repository managers and users of DSpace
> to better inform our decisions on this feature going forward.
>
> Thanks,
>
> Tim
>
> [1] More on "cloaking", which can be a spamming technique to trick
> search engines (and is therefore actively blocked by many search
> engines): https://en.wikipedia.org/wiki/Cloaking
>
> --
> Tim Donohue
> Technical Lead for DSpace & DSpaceDirect
> DuraSpace.org | DSpace.org | DSpaceDirect.org
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Dspace-general mailing list
> Dspace-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-general
>

------------------------------------------------------------------------------

_______________________________________________
Dspace-general mailing list
Dspace-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-general

Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search Engine inclusion implications

Reply via email to