The reason Anurag gave for disliking cover pages was, that they can make it difficult to discern things like - author - title, journal, …. It seems to me that if the generated cover page includes those metadata fields along with custom text explaining the origin of the pdf, google scholar should not have any difficulty getting to the metadata they are looking for. Another ‘bad case’ Anurag mentioned was documents that have multiple cover pages. I expect that the current implementation does avoid adding cover pages to already ‘covered’ pdfs.
Monika — Monika Mevenkamp Digital Repository Infrastructure Developer Phone: 609-258-4161 333C 701 Carnegie, Princeton University, Princeton, NJ 08544 > On Jun 18, 2015, at 12:23 PM, Tim Donohue <tdono...@duraspace.org> wrote: > > Hi All, > > If you attended the Open Repositories 2015 (or followed along remotely), > you may have heard about the "Indexing Repositories: Pitfalls and Best > Practices" talk given by Anurag Acharya (co-creator of Google Scholar. > > If you haven't yet seen the talk, the slides are available at: > http://www.or2015.net/wp-content/uploads/2015/06/or-2015-anurag-google-scholar.pdf > > The video should be available from the OR15 website in the coming weeks. > > One of the common indexing "pitfalls" mentioned by Anurag was > automatically inserting PDF Cover Pages into PDFs. From what I can > recall, there's a few reasons this can be problematic: > > 1. Google Scholar (and possibly other search engines) attempts to > extract metadata from the text of PDF (using some language processing > and format identification techniques). This metadata includes > auto-extracting title, abstract and author information from PDFs. > Unfortunately, the addition of a PDF coverpage often breaks this > metadata extraction, which may result in the document not appearing in > Google Scholar. > > 2. If all the PDF cover pages in your site look nearly identical (or > completely identical), the Google Scholar indexer (and again possibly > others) may wrongly flag the site for "cloaking" [1]. Essentially, it > detects something is "fishy" as all the documents look very similar. > This may result in the removal of the entire site from Google Scholar. > > So, to get to my question. In DSpace 5.0, we actually added a basic PDF > Cover Page capability (which was requested by DCAT and others): > https://wiki.duraspace.org/display/DSDOC5x/PDF+Citation+Cover+Page > > As this may have strong implications for inclusion in Google Scholar, > should we consider removing this functionality from DSpace? > > For the time being, I've placed warnings in the Documentation for this > feature to try to dissuade institutions from enabling it if Google > Scholar inclusion is of high importance. > > This isn't really a technical issue (as we can easily remove code). But, > I am interested in feedback from repository managers and users of DSpace > to better inform our decisions on this feature going forward. > > Thanks, > > Tim > > [1] More on "cloaking", which can be a spamming technique to trick > search engines (and is therefore actively blocked by many search > engines): https://en.wikipedia.org/wiki/Cloaking > > -- > Tim Donohue > Technical Lead for DSpace & DSpaceDirect > DuraSpace.org | DSpace.org | DSpaceDirect.org > > ------------------------------------------------------------------------------ > _______________________________________________ > Dspace-general mailing list > Dspace-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspace-general ------------------------------------------------------------------------------ _______________________________________________ Dspace-general mailing list Dspace-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-general