Hello, I would like to attract your attention (even if for a few seconds) to the following virtual machine (accessible also from outside CERN)
http://insdev01.cern.ch It hosts the MontySolr instance, as an example of search engine for Invenio, but provided by Solr. It is not there to incite you to find all the differences between Invenio and SOLR :) but hopefully it will allow us to see them in action. In fact, MontySolr now has the potential to behave *exactly* like Invenio Invenio: http://insdev01.cern.ch/search?ln=en&p=boson&f=&action_search=Search&c=Atlantis+Institute+of+Fictive+Science&sf=year&so=d&rm=&rg=10&sc=0&of=hb Montysolr: http://insdev01.cern.ch/search?ln=en&p=boson&f=&solrpie=on&action_search=Search&c=Atlantis+Institute+of+Fictive+Science&sf=year&so=d&rm=&rg=10&sc=0&of=hb There is a checkbox next to the search box, when activated, the search goes through MontySolr, otherwise it goes through standard Invenio search engine. I am now trying to increase the DB space on my virtual machine and load the >30GB INSPIRE dump, then I can index INSPIRE send a link with the search engine Currently MontySolr doesn't know how to deal with collections (it returns everything as a list of hits), but that is an issue to be solved on the Invenio side later. Also, I'd expect some queries may fail because I didn't spend much time tuning everything, but it is a configuration issue - if you are interested in details, please continue reading (you were warned...) MontySolr is a Solr search appliances that uses both SOLR and Invenio to search in data. It it able to harvest search results from both Invenio and Solr, merge them into one set, sort them and format them Invenio or SOLR way, and return as SOLR XML. MontySolr now contains a completeley new query parser that can understand different language grammars (one of them, now working, is the Invenio grammar). If you are interested, how Invenio grammar might look like, please take look at this: http://insdev01.cern.ch/img/Invenio.g http://insdev01.cern.ch/img/Invenio.html Just as a reminder, the code is at: https://github.com/romanchyla/montysolr): Now, to briefly describe how it works.. The Invenio installation (sits at: insdev01.cern.ch) is where user submits a query, Invenio then delegates the job to a remote search service: In this case, the remote search service is SOLR at insdev03.cern.ch:8983 The search service receives the query (including all parameters) and answers it - in the process, it will consult Solr or Invenio indexes (insdev03.cern.ch has access to the same database as insdev01.cern.ch - but it is not writing to it). Results are sent in a standard Solr XML format back. Invenio (at insdev01) receives them, displays them to the user directly (eg. citation summary) or formats the records and displays them. Once installed and configured, the search engine automatically tracks record changes in the main Invenio -- without any intervention. So for example, when a cataloguer changes a record, Invenio can simply ping this url http://insdev03.cern.ch:8983/solr/invenio/update And MontySolr discovers by itself what was changed - no need to pass any arguments. Or there can be a cron job that pings this url every x second (it is very fast). Everything is controlled using REST API, so for example: to update an index: http://insdev03.cern.ch:8983/solr/invenio/update to commit changes (now it is not *configured* to be automatic): http://insdev03.cern.ch:8983/solr/update?commit=true to search data (SOLR way): http://insdev03.cern.ch:8983/solr/select to search data (Invenio way): http://insdev03.cern.ch:8983/solr/invenio When indexing, the Solr will retrieve data from Invenio using marc xml exports as this is the most flexible option (but at the same SOLR is able to retrieve data from the database directly - it can simply ask Invenio for it, this is how fulltexts are indexed) Please have a look and play some more, it is not all ideal yet. I know, for example, that the automated installation is failing on old versions of Ant (1.6), so I'll work on that. Also, now MontySolr is having problems with multiprocessing - probably some changes I have done broke the functionality, I'll have to find it. I haven't done a release yet, but the major changes since the initial release are these: - MontySolr is split into components -> core and plugins. - core handles communication between Java and Python (nothing much else). - tons on unittests now - three new contribs - Invenio: provides the search engine that behaves the same way as Invenio - AdvancedQueryParser: support for different query languages (can handle very different grammars) - newseman - semantic search provided by this http://code.google.com/p/newseman/ - the built process is automated (using ant) and followns the lucene build infrastructure (fully automated for 95% of components at this stage) - specific configuration (eg. for invenio-demo, inspire, ADS) can be written as contrib I'll send more information about the query parsing probably tomorrow. There are some issues (with the Invenio query syntax) that are rather interesting. They will be better illustrated using some examples and pictures. This email is already too long :) I wish you a pleasant day&night! roman PS: and now the demo can break... :)
