Hi Tim, I hope we can find a way to preserve the use case. For us, the preservation and promotion of our research is more important to us.
Regards hg *Hilton Gibson* Ubuntu Linux Systems Administrator Stellenbosch University Library http://staff.lib.sun.ac.za/~hgibson/docs/cv/cv.html On 18 June 2015 at 19:11, Tim Donohue <tdono...@duraspace.org> wrote: > Hi Hilton, > > First off, thanks for the feedback. A few comments inline based on your > thoughts and observations. > > On 6/18/2015 11:52 AM, Hilton Gibson wrote: > >> The use case is very important. People download DF's and do not remember >> where it came from. More importantly the PDF itself usually has no >> permanent identifiers to help with citations. >> > > I definitely can understand the use case, and it is the same one I've > heard from other users as well. > > But, at the same time, we do need to find a balance with *preserving* the > original PDF (which is the counter use case). Dynamically altering a PDF > may never been entirely error-free, so there's the risk that inserting > these cover pages could cause issues with the downloaded PDF itself. > > Google Scholar has made it clear that they are much more interested in the > original PDF than any locally modified version. > > > Why is Google concentrating on extracting metadata from the PDF files. >> The PDF format is not standard to start with, secondly DSpace already >> does this. So just expose the metadata DSpace extracts to Google. >> > > Google does also grab the metadata from the repository itself. But, from > my understanding, Google has found that the repository metadata is often > either incomplete or wrong (not just in DSpace but everywhere). There may > be spelling errors in the metadata, authors missing (some institutions only > enter metadata for authors *at* their institution), incorrect dates of > publication, or other important metadata fields which are just missing. > > So, Google's practice has been to also extract this information from the > PDF itself as an additional source and to try to resolve discrepancies > between multiple sites. For example, multiple repositories may include the > same PDF article, but Google Scholar wants to only list it once in their > results page, providing multiple links to where that same article can be > downloaded. > > From my understanding Google Scholar has figured out ways to extract this > metadata based on the structure of a "normal" scholarly document/article > (which often includes a title, author, abstract and even dates/citation > information all on the first page). There, the addition of custom cover > pages can throw things off, and may cause Google Scholar to no longer be > able to verify the reported metadata or resolve discrepancies. This in turn > can sometimes cause the item to not be indexed by Google Scholar. > > But, please understand, I obviously don't know exactly how Google performs > all these metadata extraction activities. This is just based on what I've > heard from Anurag (co-creator of Google Scholar). > > Perhaps add a warning about Google and suggest putting the "cover page" >> at the back of the PDF. >> See: http://wiki.lib.sun.ac.za/index.php/SUNScholar/PDF_Cover_Page/5.X >> for our config. >> > > This might be an option for institutions who really want to have some sort > of repository-based metadata added to their PDFs. Moving the "cover page" > to the last page might be a possible compromise here. > > - Tim >
------------------------------------------------------------------------------
_______________________________________________ Dspace-general mailing list Dspace-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-general