Developers, DCAT members and others,

Last week, Jonathan Markow (DuraSpace), Bram Luyten (@mire, DSpace 
Commiter) and I had a brief call with the Google Scholar team (Anurag & 
Darcy) with regards to DSpace. Anurag Acharya (Tech Lead) and Darcy 
Dapra (Product Manager) had emailed DuraSpace to ask for this meeting. 
We invited along Bram because of his Google Scholar coverage analysis 
work and general interest in this area.

Last year (around this same time) Anurag & Darcy had reported several 
indexing issues related to DSpace 3.x and below. All of those issues 
have now been resolved in DSpace 4.0 thanks to the hard work of our 
Committers & Developers!
* https://jira.duraspace.org/browse/DS-1481
* https://jira.duraspace.org/browse/DS-1482
* https://jira.duraspace.org/browse/DS-1483

This year's meeting seemed more positive in many ways. Anurag said he 
felt like DSpace's indexability is improving in recent releases (esp. 
based on the flexibility of the system, and the ability to customize it 
heavily).

There were no new major DSpace issues they had to report. Rather, Anurag 
mostly wanted to see if there were ways to either "make it harder" to 
mis-configure DSpace or "provide better warnings/advice/detection" with 
regards to configuring DSpace properly. Anurag specifically mentioned 
that often DSpace coverage issues in Google Scholar are caused by a 
misconfiguration of DSpace.

So, essentially there were two main ideas/brainstorms that came up in 
the discussion:

(1) Think about building an "Indexing problems detection tool/service" 
for DSpace users. This may help DSpace users detect issues more 
immediately, and hopefully help them get better coverage in GS & similar.
     * Would attempt to programmatically detect configuration issues in 
a DSpace site that would cause indexing problems (especially with GS)
     * Simple examples may be:
          * An improperly configured robots.txt
          * Sitemaps are missing/disabled
          * Missing or incorrect "citation" meta tags in HTML (which is 
what Google Scholar uses)
          * (Anurag will forward on some more specific examples he's seen)
     * Anurag feels it'd be best to make such a tool DSpace-specific, as 
it's easier to "guess" where things should be (e.g. we know the path of 
the DSpace sitemaps, we know what a good "robots.txt" looks like for 
DSpace, etc)
     * Such a tool likely would NOT need to crawl/scan an entire DSpace 
site. Rather it would just check a small sample for possible known issues.
     * Such a tool could be something users run themselves, or even 
perhaps a hosted service (off of http://dspace.org or similar) where 
users could enter their DSpace URL and get a report back.
     * It's unknown as of yet who would build this tool or what it would 
look like exactly. I'll be talking with DuraSpace and DSpace Committers 
about it. But all on the call agreed this sounds like an interesting 
idea to investigate further.

(2) Possibly improve DSpace default settings (try to enable things most 
people really should have enabled if they want good search engine 
coverage) and/or make it harder to disable some indexing related features.
      * E.g. Could we make Sitemaps enabled by default, and also 
autogenerated? (i.e. no longer require they be enabled & updated via a 
cron job) (https://jira.duraspace.org/browse/DS-1901)
      * E.g. If we make the default DC schema "read-only" (or mostly 
read-only), we could better standardize the crosswalking to the 
"citation" metatags that Google Scholar needs. Currently if someone 
removes/changes dc metadata fields, it may accidentally affect what is 
displayed in "citation" metatags. (This idea is already being suggested 
by https://jira.duraspace.org/browse/DS-1631)
      * Might also be worth reviewing our default "robots.txt" files for 
XMLUI and JSPUI.

Overall though, I felt it was good to hear mostly positive feedback from 
the Google Scholar team!

REMINDER: If you are wondering whether your DSpace instance is following 
best practices, we recommend reviewing these guidelines in our 
documentation: 
https://wiki.duraspace.org/display/DSDOC4x/Search+Engine+Optimization

Comments/Questions/Thoughts welcome.

- Tim

-- 
Tim Donohue
Technical Lead for DSpace & DSpaceDirect
DuraSpace.org | DSpace.org | DSpaceDirect.org

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to