Hi Mark! (switch from General to Developers list...)
IMHO too, SolR is the way to go for the DSpace Community as a whole.
For many reasons: the ones you mention + the possibility to integrate
efficiently with VuFind, BlackLight, BibApp, Drupal
(a SolR index can be consulted by other applications too).
For now, I am under heavy pressure of delivering a wholly integrated
system to manage thesauri, indexation and scientific literature.
My "by the lake" book is http://searchuserinterfaces.com
Meanwhile, I hope my users will evaluate my current development
(faceting using Lucene and very few Java code lines).
A good convention to encode references to authority lists (or thesauri)
is essential to achieve easy Lucene (or SolR) processing.
ONE authority precise reference (a concept within a (SKOS) scheme: the
reference combines the scheme code and the concept code) must be ONE
Lucene token recognizable amongst WORDS
(something must distinguish scheme+code references from ordinary words
in the Lucene index).
Faceting is then easy: Lucene Tokens can be filtered for references to
authority lists, references that can be sorted and counted.
And a field can be normalized or not, with concepts coming from one or
more different authority lists (each reference identifies its scheme)
(in the JPG image sent in the previous message to General List, you see
concepts being substances, plants, general descriptors, etc.).
In the project ASKOSI, we choosed to encode references with the very
simple syntax: scheme_concept
(the scheme ID being any NMTOKEN containing no underscore. The concept
ID is also an NMTOKEN, underscores allowed).
Any word containing a "_" (underscore) is then a candidate reference to
an authority list.
If the scheme and the concept mentioned around the "_" are existing in
the authority list, the reference is considered valid.
More info starting from page:
http://www.destin.be/ASKOSI/Wiki.jsp?page=Projects
http://www.destin.be/ASKOSI/Wiki.jsp?page=Referring%20to%20Concepts
and more to come in the following weeks.
The document http://www.askosi.org/askosi_presentation.pdf is presenting
ASKOSI and designate SolR as an essential part for our development.
But I will gather user experience (and future motivation) with the
current Lucene based implementation.
For DSpace indexes, we added "indexation types" in dspace.cfg:
* search (like now) for a normal word index.
If an authority is present, its code is indexed but also all its
names (in all languages), its synonyms and notations (identifiers in
other coding systems, CAS for example)
example: search.index.keyword = dc.subject, dc.subject.mesh
Field content is not STORED in Lucene (only indexed)
* broadsearch for an index like "search" above but where also codes of
all BTs (SKOS broader and broadMatch) are indexed: this allow searches
to automatically encompasses NTs (SKOS narrower and narrowMatch).
Field content is not STORED in Lucene (only indexed)
example: broadsearch.index.lc = dc.subject.classification
broadsearch.index.allkeyword=
dc.subject,dc.subject.classification,dc.subject.meshmaj,
dc.subject.mesh, dc.subject.meshold,
dc.subject.substance,dc.subject.person, dc.title,dc.title.alternative
* faceted for an index like "search" above but field occurrences
containing authority references are STORED in Lucene (everything is
indexed) to allow faceting
example: faceted.index.pubtype=dc.type
* sort: content is normalized (upper case, no trailing spaces) for an
untokenized index usable to sort search results
example: sort.index.titleissued=dc.title,dc.date.issued
* date: content is normalized as a date for indexing
example: date.index.available=dc.date.available
* number: content is normalized as a number (range search) for indexing
example: number.index.pmid=dc.identifier.pmid
I hope to be in position to make publicly available faceting function
and ASKOSI displays for www.WindMusic.org this autumn.
Have a nice day,
Christophe
Le 19/07/2010 22:17, Mark Diggory a écrit :
Hello Sauluha, Chris and everyone else interested in this topic,
I will comment that a number of individuals contacted me offline to
offer words of support on the discovery activities presented at OR10.
Chris, I want to let you know that the work on Discovery is about
using solr as the "service" for search and browse capabilities in
DSpace, and this does not elleviate the need to have good practices
and detailed strategies for how to organize search and browse fields
for faceting in DSpace. But only designates where such work should go on.
At this time, one thing I did not present on (that I wish I had) is
that Discovery takes a minimalist approach to indexing DSpace objects,
mapping verbatim the metadata fields and content in a strategy
"dissimilar" to that of the current lucene implementation, the old
DSpace way that indexing occurred was to create a set of properties in
the dspace.cfg that mapped things like
Lucene Author Field == dc.contributor.author + dc.creator
While we could have done the same for discovery, we chose instead to
us Solr to abstract this process away from DSpace entirely. Thus
DSpace just issues verbatim that.
Item dc.contributor.author == Lucene dc.contributor.author
It is then left an exercise for the configuration of solr
to process the merging, which it manages quite well without our having
to hardcode such activities into the DSpace codebase itself.
Thus we attain with just solr configuration the merging requirements
for dublin core such as:
dc.contributor = dc.contributor.*
And the maintainer can do more complex customizations such as
configuring analyzers capable of parsing/tokening specific field
values etc. for instance if one wanted to index multilingual filed
values based on a dictionary lookup or appropriate label values for an
authority key stored in a controlled metadata field.
The ultimate objective of Discovery is to enable the complete
replacement of significant portions of hardcoded DSpace codebase with
just usage of Solr directly. Alleviating what is a resource strapped,
developer centric, activity with a small community (DSpace
Search/Browse) with a more configurable process that has a much larger
and experienced enterprise community of support (Solr).
This said, there is still a need to improve how we organize our
DSpace Items and the metadata therein into Solr indexes, and on which
side of the indexing process (DSpace Indexing Client vs Solr Request
Handlers) it is more approprate to do such activities. Chris, I would
be very interested to see contribution on how to map such features as
hierarchical controlled vocabularies and other well defined /
normaized preexisting taxonomies/vocabularies together with Solr
Facets to allow more complex facetting features. I will add in that
one tool we are considering to enhance Solr facetting further is the
Bobo Browse Integration with Solr
(http://code.google.com/p/bobo-browse/wiki/SolrIntegration and
http://snaprojects.jira.com/wiki/display/BOBO/Home). The intention
here is to provide (1) sorting of facett values and (2) grouping of
search results by multiple sort fields and (3) performance
enhancements on top of Solr facetting.
So I finally challenge both Chris and Sauluha, you will get "more bang
for your buck" if you target Solr for such features as
auto-classification and auto-hierarchy building rather than hardcoding
it into DSpace itself. You will ultimately target a community of
users much larger than DSpace alone and possibly
attain greater buy-in, peer review, contribution and reuse on such
enhancements. All of which will feed-back to benefit DSpace in the end.
If you hardwire such tooling to DSpace for a "quick win", you not only
limit the exposure and success of your own work, but if contributed
into the core DSpace implementation, you will also
be restricting other DSpace stakeholders to have to assist in
maintaining it over the long term, this is the same problem that arose
with the original Search/Browse implementation in DSpace. The DSpace
community should always work to reuse more popular third party
solutions with large cross market OS communities rather than inventing
its own custom solutions. This is because the application targets a
specific narrow vertical market for Institutional Repositories that is
resource limited. Reuse avoids DSpace stakeholders being stuck with a
stagnating codebase (only known by developers that have left the
project) while the larger OS community continues to evolve.
Sincerely,
Mark
On Tue, Jul 13, 2010 at 4:38 AM, Christophe Dupriez
<christophe.dupr...@destin.be <mailto:christophe.dupr...@destin.be>>
wrote:
Hi Sauleha!
SolR in DSpace 1.6, for now, is used to manage statistical reports
generation.
Mark Diggory (@mire) is experimenting integration of SolR as an
indexation/search engine for DSpace Items
(project called DSpace Discovery).
Thank you for bringing CASTANET to my attention: it seems a
refreshing way to cope with indexation.
I must learn more!
http://www.powershow.com/view/1e363-NWU3N/Castanet_Using_WordNet_to_Build_Facet_Hierarchies
I just learned about Flamenco and orderered the printed copy of
the book:
http://searchuserinterfaces.com
which is probably a "must read" for all DSpace developers!
Personnaly, I went thru extensive improvements of Lucene
integration for DSpace 1.42.
I was wondering for much too long how to integrate SolR to provide
faceting to my users.
Finaly, I have done it with Lucene alone (no SolR added!).
It is rather simple (few days of work) IF and ONLY IF your
faceting data is perfectly controled and normalized upfront.
Our approach to control and normalization is described here:
http://dsug09.ub.gu.se/index.php/dsug/dsug09/paper/viewFile/22/3
I join a JPG of the current result (query about "MUSIC*" in a
database of 90 thousands articles about toxicology).
If it gets scrubbed, I can send it separately.
It ask for some changes in classes:
* DSQuery to analyse current research
* Faceter, a new class to gather faceting information and to
generate desired output
* and a modification in search\results.jsp to include a call to
Faceter in the right column of the page.
Much simpler than integrating SolR.
BUT, SolR in DSpace would bring many other benefits....
If DSpace committers take it on their shoulders (too many
modifications everywhere in DSpace code for an "outsider")
Good luck!
Christophe Dupriez
Le 13/07/2010 11:03, Sauleha Durrani a écrit :
Dear all,
I am trying to integrate multifaceted search with dspace.. I am
facing several issues.
* Apache Solr provides faceted search over lucene but I am
unable to understand its working. Can anyone guide me in
how Solr works? and will it help us in integrating
multifaceted search with Dspace ???
* My other question is that I am also working on a
multifaceted algorithm, We have derived its idea
from "CATSANET". Does anybody has another idea?
I shall be anxiously waiting for reply..
Thank you.
Take care
Best Regards ..
SAULEHA */ /*
------------------------------------------------------------------------
Hotmail: Trusted email with powerful SPAM protection. Sign up
now. <https://signup.live.com/signup.aspx?id=60969>
------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visitsprint.com/first <http://sprint.com/first> --
http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Dspace-general mailing list
dspace-gene...@lists.sourceforge.net
<mailto:dspace-gene...@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/dspace-general
------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first <http://sprint.com/first> --
http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Dspace-general mailing list
dspace-gene...@lists.sourceforge.net
<mailto:dspace-gene...@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/dspace-general
--
Mark R. Diggory
Head of U.S. Operations - @mire
http://www.atmire.com - Institutional Repository Solutions
http://www.togather.eu - Before getting together, get t...@ther
------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel