Developers, DCAT members and others, Last week, Jonathan Markow (DuraSpace), Bram Luyten (@mire, DSpace Commiter) and I had a brief call with the Google Scholar team (Anurag & Darcy) with regards to DSpace. Anurag Acharya (Tech Lead) and Darcy Dapra (Product Manager) had emailed DuraSpace to ask for this meeting. We invited along Bram because of his Google Scholar coverage analysis work and general interest in this area.
Last year (around this same time) Anurag & Darcy had reported several indexing issues related to DSpace 3.x and below. All of those issues have now been resolved in DSpace 4.0 thanks to the hard work of our Committers & Developers! * https://jira.duraspace.org/browse/DS-1481 * https://jira.duraspace.org/browse/DS-1482 * https://jira.duraspace.org/browse/DS-1483 This year's meeting seemed more positive in many ways. Anurag said he felt like DSpace's indexability is improving in recent releases (esp. based on the flexibility of the system, and the ability to customize it heavily). There were no new major DSpace issues they had to report. Rather, Anurag mostly wanted to see if there were ways to either "make it harder" to mis-configure DSpace or "provide better warnings/advice/detection" with regards to configuring DSpace properly. Anurag specifically mentioned that often DSpace coverage issues in Google Scholar are caused by a misconfiguration of DSpace. So, essentially there were two main ideas/brainstorms that came up in the discussion: (1) Think about building an "Indexing problems detection tool/service" for DSpace users. This may help DSpace users detect issues more immediately, and hopefully help them get better coverage in GS & similar. * Would attempt to programmatically detect configuration issues in a DSpace site that would cause indexing problems (especially with GS) * Simple examples may be: * An improperly configured robots.txt * Sitemaps are missing/disabled * Missing or incorrect "citation" meta tags in HTML (which is what Google Scholar uses) * (Anurag will forward on some more specific examples he's seen) * Anurag feels it'd be best to make such a tool DSpace-specific, as it's easier to "guess" where things should be (e.g. we know the path of the DSpace sitemaps, we know what a good "robots.txt" looks like for DSpace, etc) * Such a tool likely would NOT need to crawl/scan an entire DSpace site. Rather it would just check a small sample for possible known issues. * Such a tool could be something users run themselves, or even perhaps a hosted service (off of http://dspace.org or similar) where users could enter their DSpace URL and get a report back. * It's unknown as of yet who would build this tool or what it would look like exactly. I'll be talking with DuraSpace and DSpace Committers about it. But all on the call agreed this sounds like an interesting idea to investigate further. (2) Possibly improve DSpace default settings (try to enable things most people really should have enabled if they want good search engine coverage) and/or make it harder to disable some indexing related features. * E.g. Could we make Sitemaps enabled by default, and also autogenerated? (i.e. no longer require they be enabled & updated via a cron job) (https://jira.duraspace.org/browse/DS-1901) * E.g. If we make the default DC schema "read-only" (or mostly read-only), we could better standardize the crosswalking to the "citation" metatags that Google Scholar needs. Currently if someone removes/changes dc metadata fields, it may accidentally affect what is displayed in "citation" metatags. (This idea is already being suggested by https://jira.duraspace.org/browse/DS-1631) * Might also be worth reviewing our default "robots.txt" files for XMLUI and JSPUI. Overall though, I felt it was good to hear mostly positive feedback from the Google Scholar team! REMINDER: If you are wondering whether your DSpace instance is following best practices, we recommend reviewing these guidelines in our documentation: https://wiki.duraspace.org/display/DSDOC4x/Search+Engine+Optimization Comments/Questions/Thoughts welcome. - Tim -- Tim Donohue Technical Lead for DSpace & DSpaceDirect DuraSpace.org | DSpace.org | DSpaceDirect.org ------------------------------------------------------------------------------ Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk _______________________________________________ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel