[Dspace-devel] How solr works with dpsace

Christophe Dupriez Tue, 20 Jul 2010 05:59:46 -0700

Hi Mark! (switch from General to Developers list...)

IMHO too, SolR is the way to go for the DSpace Community as a whole.

For many reasons: the ones you mention + the possibility to integrateefficiently with VuFind, BlackLight, BibApp, Drupal

(a SolR index can be consulted by other applications too).

For now, I am under heavy pressure of delivering a wholly integratedsystem to manage thesauri, indexation and scientific literature.

My "by the lake" book is http://searchuserinterfaces.com

Meanwhile, I hope my users will evaluate my current development(faceting using Lucene and very few Java code lines).

A good convention to encode references to authority lists (or thesauri)is essential to achieve easy Lucene (or SolR) processing.ONE authority precise reference (a concept within a (SKOS) scheme: thereference combines the scheme code and the concept code) must be ONELucene token recognizable amongst WORDS(something must distinguish scheme+code references from ordinary wordsin the Lucene index).Faceting is then easy: Lucene Tokens can be filtered for references toauthority lists, references that can be sorted and counted.And a field can be normalized or not, with concepts coming from one ormore different authority lists (each reference identifies its scheme)(in the JPG image sent in the previous message to General List, you seeconcepts being substances, plants, general descriptors, etc.).

In the project ASKOSI, we choosed to encode references with the verysimple syntax: scheme_concept(the scheme ID being any NMTOKEN containing no underscore. The conceptID is also an NMTOKEN, underscores allowed).Any word containing a "_" (underscore) is then a candidate reference toan authority list.If the scheme and the concept mentioned around the "_" are existing inthe authority list, the reference is considered valid.


More info starting from page:
http://www.destin.be/ASKOSI/Wiki.jsp?page=Projects
http://www.destin.be/ASKOSI/Wiki.jsp?page=Referring%20to%20Concepts
and more to come in the following weeks.

The document http://www.askosi.org/askosi_presentation.pdf is presentingASKOSI and designate SolR as an essential part for our development.But I will gather user experience (and future motivation) with thecurrent Lucene based implementation.


For DSpace indexes, we added "indexation types" in dspace.cfg:
* search (like now) for a normal word index.

If an authority is present, its code is indexed but also all itsnames (in all languages), its synonyms and notations (identifiers inother coding systems, CAS for example)

   example: search.index.keyword = dc.subject, dc.subject.mesh
   Field content is not STORED in Lucene (only indexed)

* broadsearch for an index like "search" above but where also codes ofall BTs (SKOS broader and broadMatch) are indexed: this allow searchesto automatically encompasses NTs (SKOS narrower and narrowMatch).

   Field content is not STORED in Lucene (only indexed)
   example: broadsearch.index.lc = dc.subject.classification

broadsearch.index.allkeyword=dc.subject,dc.subject.classification,dc.subject.meshmaj,dc.subject.mesh, dc.subject.meshold,dc.subject.substance,dc.subject.person, dc.title,dc.title.alternative* faceted for an index like "search" above but field occurrencescontaining authority references are STORED in Lucene (everything isindexed) to allow faceting

   example: faceted.index.pubtype=dc.type

* sort: content is normalized (upper case, no trailing spaces) for anuntokenized index usable to sort search results

   example: sort.index.titleissued=dc.title,dc.date.issued
* date: content is normalized as a date for indexing
   example: date.index.available=dc.date.available
* number: content is normalized as a number (range search) for indexing
   example: number.index.pmid=dc.identifier.pmid

I hope to be in position to make publicly available faceting functionand ASKOSI displays for www.WindMusic.org this autumn.


Have a nice day,

Christophe

Le 19/07/2010 22:17, Mark Diggory a écrit :

Hello Sauluha, Chris and everyone else interested in this topic,
I will comment that a number of individuals contacted me offline tooffer words of support on the discovery activities presented at OR10.Chris, I want to let you know that the work on Discovery is aboutusing solr as the "service" for search and browse capabilities inDSpace, and this does not elleviate the need to have good practicesand detailed strategies for how to organize search and browse fieldsfor faceting in DSpace. But only designates where such work should go on.
At this time, one thing I did not present on (that I wish I had) isthat Discovery takes a minimalist approach to indexing DSpace objects,mapping verbatim the metadata fields and content in a strategy"dissimilar" to that of the current lucene implementation, the oldDSpace way that indexing occurred was to create a set of properties inthe dspace.cfg that mapped things like
Lucene Author Field == dc.contributor.author + dc.creator
While we could have done the same for discovery, we chose instead tous Solr to abstract this process away from DSpace entirely. ThusDSpace just issues verbatim that.
Item dc.contributor.author == Lucene dc.contributor.author
It is then left an exercise for the configuration of solrto process the merging, which it manages quite well without our havingto hardcode such activities into the DSpace codebase itself.
Thus we attain with just solr configuration the merging requirementsfor dublin core such as:
dc.contributor = dc.contributor.*
And the maintainer can do more complex customizations such asconfiguring analyzers capable of parsing/tokening specific fieldvalues etc. for instance if one wanted to index multilingual filedvalues based on a dictionary lookup or appropriate label values for anauthority key stored in a controlled metadata field.
The ultimate objective of Discovery is to enable the completereplacement of significant portions of hardcoded DSpace codebase withjust usage of Solr directly. Alleviating what is a resource strapped,developer centric, activity with a small community (DSpaceSearch/Browse) with a more configurable process that has a much largerand experienced enterprise community of support (Solr).
This said, there is still a need to improve how we organize ourDSpace Items and the metadata therein into Solr indexes, and on whichside of the indexing process (DSpace Indexing Client vs Solr RequestHandlers) it is more approprate to do such activities. Chris, I wouldbe very interested to see contribution on how to map such features ashierarchical controlled vocabularies and other well defined /normaized preexisting taxonomies/vocabularies together with SolrFacets to allow more complex facetting features. I will add in thatone tool we are considering to enhance Solr facetting further is theBobo Browse Integration with Solr(http://code.google.com/p/bobo-browse/wiki/SolrIntegration andhttp://snaprojects.jira.com/wiki/display/BOBO/Home). The intentionhere is to provide (1) sorting of facett values and (2) grouping ofsearch results by multiple sort fields and (3) performanceenhancements on top of Solr facetting.
So I finally challenge both Chris and Sauluha, you will get "more bangfor your buck" if you target Solr for such features asauto-classification and auto-hierarchy building rather than hardcodingit into DSpace itself. You will ultimately target a community ofusers much larger than DSpace alone and possiblyattain greater buy-in, peer review, contribution and reuse on suchenhancements. All of which will feed-back to benefit DSpace in the end.
If you hardwire such tooling to DSpace for a "quick win", you not onlylimit the exposure and success of your own work, but if contributedinto the core DSpace implementation, you will alsobe restricting other DSpace stakeholders to have to assist inmaintaining it over the long term, this is the same problem that arosewith the original Search/Browse implementation in DSpace. The DSpacecommunity should always work to reuse more popular third partysolutions with large cross market OS communities rather than inventingits own custom solutions. This is because the application targets aspecific narrow vertical market for Institutional Repositories that isresource limited. Reuse avoids DSpace stakeholders being stuck with astagnating codebase (only known by developers that have left theproject) while the larger OS community continues to evolve.
Sincerely,
Mark
On Tue, Jul 13, 2010 at 4:38 AM, Christophe Dupriez<christophe.dupr...@destin.be <mailto:christophe.dupr...@destin.be>>wrote:
    Hi Sauleha!

    SolR in DSpace 1.6, for now, is used to manage statistical reports
    generation.
    Mark Diggory (@mire) is experimenting integration of SolR as an
    indexation/search engine for DSpace Items
    (project called DSpace Discovery).

    Thank you for bringing CASTANET to my attention: it seems a
    refreshing way to cope with indexation.
    I must learn more!
    
http://www.powershow.com/view/1e363-NWU3N/Castanet_Using_WordNet_to_Build_Facet_Hierarchies
    I just learned about Flamenco and orderered the printed copy of
    the book:
    http://searchuserinterfaces.com
    which is probably a "must read" for all DSpace developers!

    Personnaly, I went thru extensive improvements of Lucene
    integration for DSpace 1.42.
    I was wondering for much too long how to integrate SolR to provide
    faceting to my users.
    Finaly, I have done it with Lucene alone (no SolR added!).
    It is rather simple (few days of work) IF and ONLY IF your
    faceting data is perfectly controled and normalized upfront.
    Our approach to control and normalization is described here:
    http://dsug09.ub.gu.se/index.php/dsug/dsug09/paper/viewFile/22/3

    I join a JPG of the current result (query about "MUSIC*" in a
    database of 90 thousands articles about toxicology).
    If it gets scrubbed, I can send it separately.

    It ask for some changes in classes:
    * DSQuery to analyse current research
    * Faceter, a new class to gather faceting information and to
    generate desired output
    * and a modification in search\results.jsp to include a call to
    Faceter in the right column of the page.

    Much simpler than integrating SolR.
    BUT, SolR in DSpace would bring many other benefits....
    If DSpace committers take it on their shoulders (too many
    modifications everywhere in DSpace code for an "outsider")

    Good luck!

    Christophe Dupriez


    Le 13/07/2010 11:03, Sauleha Durrani a écrit :
    Dear all,

    I am trying to integrate multifaceted search with dspace.. I am
    facing several issues.

        * Apache Solr provides faceted search over lucene but I am
          unable to understand its working. Can anyone guide me in
          how Solr works? and will it help us in integrating
          multifaceted search with Dspace ???
        * My other question is that I am also working on a
          multifaceted algorithm, We have derived its idea
          from "CATSANET". Does anybody has another idea?

    I shall be anxiously waiting for reply..
    Thank you.
    Take care
    Best Regards ..
    SAULEHA */ /*



    ------------------------------------------------------------------------
    Hotmail: Trusted email with powerful SPAM protection. Sign up
    now. <https://signup.live.com/signup.aspx?id=60969>


    
------------------------------------------------------------------------------
    This SF.net email is sponsored by Sprint
    What will you do first with EVO, the first 4G phone?
    Visitsprint.com/first  <http://sprint.com/first>  -- 
http://p.sf.net/sfu/sprint-com-first


    _______________________________________________
    Dspace-general mailing list
    dspace-gene...@lists.sourceforge.net  
<mailto:dspace-gene...@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/dspace-general
    
------------------------------------------------------------------------------
    This SF.net email is sponsored by Sprint
    What will you do first with EVO, the first 4G phone?
    Visit sprint.com/first <http://sprint.com/first> --
    http://p.sf.net/sfu/sprint-com-first
    _______________________________________________
    Dspace-general mailing list
    dspace-gene...@lists.sourceforge.net
    <mailto:dspace-gene...@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/dspace-general




--
Mark R. Diggory
Head of U.S. Operations - @mire

http://www.atmire.com - Institutional Repository Solutions
http://www.togather.eu - Before getting together, get t...@ther

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first

_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

[Dspace-devel] How solr works with dpsace

Reply via email to