Hi Mark! (switch from General to Developers list...)

IMHO too, SolR is the way to go for the DSpace Community as a whole.
For many reasons: the ones you mention + the possibility to integrate efficiently with VuFind, BlackLight, BibApp, Drupal
(a SolR index can be consulted by other applications too).

For now, I am under heavy pressure of delivering a wholly integrated system to manage thesauri, indexation and scientific literature.
My "by the lake" book is http://searchuserinterfaces.com
Meanwhile, I hope my users will evaluate my current development (faceting using Lucene and very few Java code lines).

A good convention to encode references to authority lists (or thesauri) is essential to achieve easy Lucene (or SolR) processing. ONE authority precise reference (a concept within a (SKOS) scheme: the reference combines the scheme code and the concept code) must be ONE Lucene token recognizable amongst WORDS (something must distinguish scheme+code references from ordinary words in the Lucene index). Faceting is then easy: Lucene Tokens can be filtered for references to authority lists, references that can be sorted and counted. And a field can be normalized or not, with concepts coming from one or more different authority lists (each reference identifies its scheme) (in the JPG image sent in the previous message to General List, you see concepts being substances, plants, general descriptors, etc.).

In the project ASKOSI, we choosed to encode references with the very simple syntax: scheme_concept (the scheme ID being any NMTOKEN containing no underscore. The concept ID is also an NMTOKEN, underscores allowed). Any word containing a "_" (underscore) is then a candidate reference to an authority list. If the scheme and the concept mentioned around the "_" are existing in the authority list, the reference is considered valid.

More info starting from page:
http://www.destin.be/ASKOSI/Wiki.jsp?page=Projects
http://www.destin.be/ASKOSI/Wiki.jsp?page=Referring%20to%20Concepts
and more to come in the following weeks.

The document http://www.askosi.org/askosi_presentation.pdf is presenting ASKOSI and designate SolR as an essential part for our development. But I will gather user experience (and future motivation) with the current Lucene based implementation.

For DSpace indexes, we added "indexation types" in dspace.cfg:
* search (like now) for a normal word index.
If an authority is present, its code is indexed but also all its names (in all languages), its synonyms and notations (identifiers in other coding systems, CAS for example)
   example: search.index.keyword = dc.subject, dc.subject.mesh
   Field content is not STORED in Lucene (only indexed)
* broadsearch for an index like "search" above but where also codes of all BTs (SKOS broader and broadMatch) are indexed: this allow searches to automatically encompasses NTs (SKOS narrower and narrowMatch).
   Field content is not STORED in Lucene (only indexed)
   example: broadsearch.index.lc = dc.subject.classification
broadsearch.index.allkeyword= dc.subject,dc.subject.classification,dc.subject.meshmaj, dc.subject.mesh, dc.subject.meshold, dc.subject.substance,dc.subject.person, dc.title,dc.title.alternative * faceted for an index like "search" above but field occurrences containing authority references are STORED in Lucene (everything is indexed) to allow faceting
   example: faceted.index.pubtype=dc.type
* sort: content is normalized (upper case, no trailing spaces) for an untokenized index usable to sort search results
   example: sort.index.titleissued=dc.title,dc.date.issued
* date: content is normalized as a date for indexing
   example: date.index.available=dc.date.available
* number: content is normalized as a number (range search) for indexing
   example: number.index.pmid=dc.identifier.pmid

I hope to be in position to make publicly available faceting function and ASKOSI displays for www.WindMusic.org this autumn.

Have a nice day,

Christophe

Le 19/07/2010 22:17, Mark Diggory a écrit :
Hello Sauluha, Chris and everyone else interested in this topic,

I will comment that a number of individuals contacted me offline to offer words of support on the discovery activities presented at OR10. Chris, I want to let you know that the work on Discovery is about using solr as the "service" for search and browse capabilities in DSpace, and this does not elleviate the need to have good practices and detailed strategies for how to organize search and browse fields for faceting in DSpace. But only designates where such work should go on.

At this time, one thing I did not present on (that I wish I had) is that Discovery takes a minimalist approach to indexing DSpace objects, mapping verbatim the metadata fields and content in a strategy "dissimilar" to that of the current lucene implementation, the old DSpace way that indexing occurred was to create a set of properties in the dspace.cfg that mapped things like

Lucene Author Field == dc.contributor.author + dc.creator

While we could have done the same for discovery, we chose instead to us Solr to abstract this process away from DSpace entirely. Thus DSpace just issues verbatim that.

Item dc.contributor.author == Lucene dc.contributor.author

It is then left an exercise for the configuration of solr to process the merging, which it manages quite well without our having to hardcode such activities into the DSpace codebase itself.

Thus we attain with just solr configuration the merging requirements for dublin core such as:

dc.contributor = dc.contributor.*

And the maintainer can do more complex customizations such as configuring analyzers capable of parsing/tokening specific field values etc. for instance if one wanted to index multilingual filed values based on a dictionary lookup or appropriate label values for an authority key stored in a controlled metadata field.

The ultimate objective of Discovery is to enable the complete replacement of significant portions of hardcoded DSpace codebase with just usage of Solr directly. Alleviating what is a resource strapped, developer centric, activity with a small community (DSpace Search/Browse) with a more configurable process that has a much larger and experienced enterprise community of support (Solr).

This said, there is still a need to improve how we organize our DSpace Items and the metadata therein into Solr indexes, and on which side of the indexing process (DSpace Indexing Client vs Solr Request Handlers) it is more approprate to do such activities. Chris, I would be very interested to see contribution on how to map such features as hierarchical controlled vocabularies and other well defined / normaized preexisting taxonomies/vocabularies together with Solr Facets to allow more complex facetting features. I will add in that one tool we are considering to enhance Solr facetting further is the Bobo Browse Integration with Solr (http://code.google.com/p/bobo-browse/wiki/SolrIntegration and http://snaprojects.jira.com/wiki/display/BOBO/Home). The intention here is to provide (1) sorting of facett values and (2) grouping of search results by multiple sort fields and (3) performance enhancements on top of Solr facetting.

So I finally challenge both Chris and Sauluha, you will get "more bang for your buck" if you target Solr for such features as auto-classification and auto-hierarchy building rather than hardcoding it into DSpace itself. You will ultimately target a community of users much larger than DSpace alone and possibly attain greater buy-in, peer review, contribution and reuse on such enhancements. All of which will feed-back to benefit DSpace in the end.

If you hardwire such tooling to DSpace for a "quick win", you not only limit the exposure and success of your own work, but if contributed into the core DSpace implementation, you will also be restricting other DSpace stakeholders to have to assist in maintaining it over the long term, this is the same problem that arose with the original Search/Browse implementation in DSpace. The DSpace community should always work to reuse more popular third party solutions with large cross market OS communities rather than inventing its own custom solutions. This is because the application targets a specific narrow vertical market for Institutional Repositories that is resource limited. Reuse avoids DSpace stakeholders being stuck with a stagnating codebase (only known by developers that have left the project) while the larger OS community continues to evolve.

Sincerely,
Mark

On Tue, Jul 13, 2010 at 4:38 AM, Christophe Dupriez <christophe.dupr...@destin.be <mailto:christophe.dupr...@destin.be>> wrote:

    Hi Sauleha!

    SolR in DSpace 1.6, for now, is used to manage statistical reports
    generation.
    Mark Diggory (@mire) is experimenting integration of SolR as an
    indexation/search engine for DSpace Items
    (project called DSpace Discovery).

    Thank you for bringing CASTANET to my attention: it seems a
    refreshing way to cope with indexation.
    I must learn more!
    
http://www.powershow.com/view/1e363-NWU3N/Castanet_Using_WordNet_to_Build_Facet_Hierarchies
    I just learned about Flamenco and orderered the printed copy of
    the book:
    http://searchuserinterfaces.com
    which is probably a "must read" for all DSpace developers!

    Personnaly, I went thru extensive improvements of Lucene
    integration for DSpace 1.42.
    I was wondering for much too long how to integrate SolR to provide
    faceting to my users.
    Finaly, I have done it with Lucene alone (no SolR added!).
    It is rather simple (few days of work) IF and ONLY IF your
    faceting data is perfectly controled and normalized upfront.
    Our approach to control and normalization is described here:
    http://dsug09.ub.gu.se/index.php/dsug/dsug09/paper/viewFile/22/3

    I join a JPG of the current result (query about "MUSIC*" in a
    database of 90 thousands articles about toxicology).
    If it gets scrubbed, I can send it separately.

    It ask for some changes in classes:
    * DSQuery to analyse current research
    * Faceter, a new class to gather faceting information and to
    generate desired output
    * and a modification in search\results.jsp to include a call to
    Faceter in the right column of the page.

    Much simpler than integrating SolR.
    BUT, SolR in DSpace would bring many other benefits....
    If DSpace committers take it on their shoulders (too many
    modifications everywhere in DSpace code for an "outsider")

    Good luck!

    Christophe Dupriez


    Le 13/07/2010 11:03, Sauleha Durrani a écrit :

    Dear all,

    I am trying to integrate multifaceted search with dspace.. I am
    facing several issues.

        * Apache Solr provides faceted search over lucene but I am
          unable to understand its working. Can anyone guide me in
          how Solr works? and will it help us in integrating
          multifaceted search with Dspace ???
        * My other question is that I am also working on a
          multifaceted algorithm, We have derived its idea
          from "CATSANET". Does anybody has another idea?

    I shall be anxiously waiting for reply..
    Thank you.
    Take care
    Best Regards ..
    SAULEHA */ /*



    ------------------------------------------------------------------------
    Hotmail: Trusted email with powerful SPAM protection. Sign up
    now. <https://signup.live.com/signup.aspx?id=60969>


    
------------------------------------------------------------------------------
    This SF.net email is sponsored by Sprint
    What will you do first with EVO, the first 4G phone?
    Visitsprint.com/first  <http://sprint.com/first>  -- 
http://p.sf.net/sfu/sprint-com-first


    _______________________________________________
    Dspace-general mailing list
    dspace-gene...@lists.sourceforge.net  
<mailto:dspace-gene...@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/dspace-general


    
------------------------------------------------------------------------------
    This SF.net email is sponsored by Sprint
    What will you do first with EVO, the first 4G phone?
    Visit sprint.com/first <http://sprint.com/first> --
    http://p.sf.net/sfu/sprint-com-first
    _______________________________________________
    Dspace-general mailing list
    dspace-gene...@lists.sourceforge.net
    <mailto:dspace-gene...@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/dspace-general




--
Mark R. Diggory
Head of U.S. Operations - @mire

http://www.atmire.com - Institutional Repository Solutions
http://www.togather.eu - Before getting together, get t...@ther

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to