Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search Engine inclusion implications

Hilton Gibson Thu, 18 Jun 2015 10:19:01 -0700

Hi Tim,

I hope we can find a way to preserve the use case.
For us, the preservation and promotion of our research is more important to
us.


Regards

hg

*Hilton Gibson*
Ubuntu Linux Systems Administrator
Stellenbosch University Library
http://staff.lib.sun.ac.za/~hgibson/docs/cv/cv.html


On 18 June 2015 at 19:11, Tim Donohue <tdono...@duraspace.org> wrote:

> Hi Hilton,
>
> First off, thanks for the feedback.  A few comments inline based on your
> thoughts and observations.
>
> On 6/18/2015 11:52 AM, Hilton Gibson wrote:
>
>> The use case is very important. People download DF's and do not remember
>> where it came from. More importantly the PDF itself usually has no
>> permanent identifiers to help with citations.
>>
>
> I definitely can understand the use case, and it is the same one I've
> heard from other users as well.
>
> But, at the same time, we do need to find a balance with *preserving* the
> original PDF (which is the counter use case). Dynamically altering a PDF
> may never been entirely error-free, so there's the risk that inserting
> these cover pages could cause issues with the downloaded PDF itself.
>
> Google Scholar has made it clear that they are much more interested in the
> original PDF than any locally modified version.
>
>
>  Why is Google concentrating on extracting metadata from the PDF files.
>> The PDF format is not standard to start with, secondly DSpace already
>> does this. So just expose the metadata DSpace extracts to Google.
>>
>
> Google does also grab the metadata from the repository itself. But, from
> my understanding, Google has found that the repository metadata is often
> either incomplete or wrong (not just in DSpace but everywhere). There may
> be spelling errors in the metadata, authors missing (some institutions only
> enter metadata for authors *at* their institution), incorrect dates of
> publication, or other important metadata fields which are just missing.
>
> So, Google's practice has been to also extract this information from the
> PDF itself as an additional source and to try to resolve discrepancies
> between multiple sites. For example, multiple repositories may include the
> same PDF article, but Google Scholar wants to only list it once in their
> results page, providing multiple links to where that same article can be
> downloaded.
>
> From my understanding Google Scholar has figured out ways to extract this
> metadata based on the structure of a "normal" scholarly document/article
> (which often includes a title, author, abstract and even dates/citation
> information all on the first page). There, the addition of custom cover
> pages can throw things off, and may cause Google Scholar to no longer be
> able to verify the reported metadata or resolve discrepancies. This in turn
> can sometimes cause the item to not be indexed by Google Scholar.
>
> But, please understand, I obviously don't know exactly how Google performs
> all these metadata extraction activities. This is just based on what I've
> heard from Anurag (co-creator of Google Scholar).
>
>  Perhaps add a warning about Google and suggest putting the "cover page"
>> at the back of the PDF.
>> See: http://wiki.lib.sun.ac.za/index.php/SUNScholar/PDF_Cover_Page/5.X
>> for our config.
>>
>
> This might be an option for institutions who really want to have some sort
> of repository-based metadata added to their PDFs. Moving the "cover page"
> to the last page might be a possible compromise here.
>
> - Tim
>

------------------------------------------------------------------------------

_______________________________________________
Dspace-general mailing list
Dspace-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-general

Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search Engine inclusion implications

Reply via email to