Hi All, First off, I just wanted to thank everyone for their thoughts, ideas, etc.
It's obvious that this is a very "hot button" topic in the DSpace community. My goal was to get the discussion started now, so that we can determine a way forward. So, I'd encourage additional feedback on this topic. As of yet, there is no decision to remove this feature from DSpace. My goal in bringing this up is to ensure we are making a "well informed" decision on the benefits & detriments of PDF Cover Pages (and ensuring all of us are aware of both sides of the argument here). While Google Scholar is not the only scholarly search engine out there, it is one of the ones I hear about most frequently from researchers and repository managers. At the very least, we should take into consideration this feedback from Google Scholar, as it definitely could have an effect on the visibility of DSpace PDFs in GS. So, at a minimum, this should help us to provide more informative warnings about some of the possible detriments of PDF cover pages. If DCAT (DSpace Community Advisory Team) is interested in re-visiting this, it also may make for a good discussion at one of your monthly calls. Thanks again all. Please do feel free to keep sending feedback! - Tim On 6/18/2015 11:23 AM, Tim Donohue wrote: > Hi All, > > If you attended the Open Repositories 2015 (or followed along remotely), > you may have heard about the "Indexing Repositories: Pitfalls and Best > Practices" talk given by Anurag Acharya (co-creator of Google Scholar. > > If you haven't yet seen the talk, the slides are available at: > http://www.or2015.net/wp-content/uploads/2015/06/or-2015-anurag-google-scholar.pdf > > > The video should be available from the OR15 website in the coming weeks. > > One of the common indexing "pitfalls" mentioned by Anurag was > automatically inserting PDF Cover Pages into PDFs. From what I can > recall, there's a few reasons this can be problematic: > > 1. Google Scholar (and possibly other search engines) attempts to > extract metadata from the text of PDF (using some language processing > and format identification techniques). This metadata includes > auto-extracting title, abstract and author information from PDFs. > Unfortunately, the addition of a PDF coverpage often breaks this > metadata extraction, which may result in the document not appearing in > Google Scholar. > > 2. If all the PDF cover pages in your site look nearly identical (or > completely identical), the Google Scholar indexer (and again possibly > others) may wrongly flag the site for "cloaking" [1]. Essentially, it > detects something is "fishy" as all the documents look very similar. > This may result in the removal of the entire site from Google Scholar. > > So, to get to my question. In DSpace 5.0, we actually added a basic PDF > Cover Page capability (which was requested by DCAT and others): > https://wiki.duraspace.org/display/DSDOC5x/PDF+Citation+Cover+Page > > As this may have strong implications for inclusion in Google Scholar, > should we consider removing this functionality from DSpace? > > For the time being, I've placed warnings in the Documentation for this > feature to try to dissuade institutions from enabling it if Google > Scholar inclusion is of high importance. > > This isn't really a technical issue (as we can easily remove code). But, > I am interested in feedback from repository managers and users of DSpace > to better inform our decisions on this feature going forward. > > Thanks, > > Tim > > [1] More on "cloaking", which can be a spamming technique to trick > search engines (and is therefore actively blocked by many search > engines): https://en.wikipedia.org/wiki/Cloaking > ------------------------------------------------------------------------------ _______________________________________________ Dspace-general mailing list Dspace-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-general