montysolr search engine

Roman Chyla Mon, 09 Jan 2012 10:42:56 -0800

Hello,

I would like to attract your attention (even if for a few seconds) to
the following virtual machine (accessible also from outside CERN)


http://insdev01.cern.ch

It hosts the MontySolr instance, as an example of search engine for
Invenio, but provided by Solr. It is not there to incite you to find
all the differences between Invenio and SOLR :) but hopefully it will
allow us to see them in action. In fact, MontySolr now has the
potential to behave *exactly* like Invenio

Invenio:
http://insdev01.cern.ch/search?ln=en&p=boson&f=&action_search=Search&c=Atlantis+Institute+of+Fictive+Science&sf=year&so=d&rm=&rg=10&sc=0&of=hb

Montysolr:
http://insdev01.cern.ch/search?ln=en&p=boson&f=&solrpie=on&action_search=Search&c=Atlantis+Institute+of+Fictive+Science&sf=year&so=d&rm=&rg=10&sc=0&of=hb

There is a checkbox next to the search box, when activated, the search
goes through MontySolr, otherwise it goes through standard Invenio
search engine.

I am now trying to increase the DB space on my virtual machine and
load the >30GB INSPIRE dump, then I can index INSPIRE send a link with
the search engine

Currently MontySolr doesn't know how to deal with collections (it
returns everything as a list of hits), but that is an issue to be
solved on the Invenio side later. Also, I'd expect some queries may
fail because I didn't spend much time tuning everything, but it is a
configuration issue - if you are interested in details, please
continue reading (you were warned...)




MontySolr is a Solr search appliances that uses both SOLR and Invenio
to search in data. It it able to harvest search results from both
Invenio and Solr, merge them into one set, sort them and format them
Invenio or SOLR way, and return as SOLR XML. MontySolr now contains a
completeley new query parser that can understand different language
grammars (one of them, now working, is the Invenio grammar).

If you are interested, how Invenio grammar might look like, please
take look at this:

http://insdev01.cern.ch/img/Invenio.g
http://insdev01.cern.ch/img/Invenio.html

Just as a reminder, the code is at: https://github.com/romanchyla/montysolr):


Now, to briefly describe how it works..

The Invenio installation (sits at: insdev01.cern.ch) is where user
submits a query, Invenio then delegates the job to a remote search
service:

In this case, the remote search service is SOLR at insdev03.cern.ch:8983

The search service receives the query (including all parameters) and
answers it - in the process, it will consult Solr or Invenio indexes
(insdev03.cern.ch has access to the same database as insdev01.cern.ch
- but it is not writing to it). Results are sent in a standard Solr
XML format back. Invenio (at insdev01) receives them, displays them to
the user directly (eg. citation summary) or formats the records and
displays them.

Once installed and configured, the search engine automatically tracks
record changes in the main Invenio -- without any intervention. So for
example, when a cataloguer changes a record, Invenio can simply ping
this url

http://insdev03.cern.ch:8983/solr/invenio/update

And MontySolr discovers by itself what was changed - no need to pass
any arguments. Or there can be a cron job that pings this url every x
second (it is very fast).

Everything is controlled using REST API, so for example:

to update an index:
http://insdev03.cern.ch:8983/solr/invenio/update

to commit changes (now it is not *configured* to be automatic):
http://insdev03.cern.ch:8983/solr/update?commit=true

to search data (SOLR way):
http://insdev03.cern.ch:8983/solr/select

to search data (Invenio way):
http://insdev03.cern.ch:8983/solr/invenio

When indexing, the Solr will retrieve data from Invenio using marc xml
exports as this is the most flexible option (but at the same SOLR is
able to retrieve data from the database directly - it can simply ask
Invenio for it, this is how fulltexts are indexed)

Please have a look and play some more, it is not all ideal yet. I
know, for example, that the automated installation is failing on old
versions of Ant (1.6), so I'll work on that. Also, now MontySolr is
having problems with multiprocessing - probably some changes I have
done broke the functionality, I'll have to find it.


I haven't done a release yet, but the major changes since the initial
release are these:

- MontySolr is split into components -> core and plugins.
- core handles communication between Java and Python (nothing much else).
- tons on unittests now
- three new contribs
  - Invenio: provides the search engine that behaves the same way as Invenio
  - AdvancedQueryParser: support for different query languages (can
handle very different grammars)
  - newseman - semantic search provided by this
http://code.google.com/p/newseman/
- the built process is automated (using ant) and followns the lucene
build infrastructure (fully automated for 95% of components at this
stage)
- specific configuration (eg. for invenio-demo, inspire, ADS) can be
written as contrib


I'll send more information about the query parsing probably tomorrow.
There are some issues (with the Invenio query syntax) that are rather
interesting. They will be better illustrated using some examples and
pictures. This email is already too long :)

I wish you a pleasant day&night!

  roman


PS: and now the demo can break... :)

montysolr search engine

Reply via email to