Search in rdf.cris

Stephane Gamard Wed, 02 Oct 2013 07:05:17 -0700

Hi Team,

My name's Stephane and I am currently participating to the Fusepool FP7 project. Within this project we are using stanbol & clerezza as key architectural components. Coming from a pure FullText search and Information Retrieval background I find myself in somewhat of a new territory.

But within that new territory there is a link to my area of expertise: Lucene/Solr in the rdf.cris package. This package turns out to be crucial for our project and I would gladly participate and contribute my knowledge as a Lucene and Solr developer. So here in a nutshell a list of "small contributions" to start with:

- Abstraction Refactoring

Currently CRIS is using Lucene as its FT engine, but we might want to eventually go to Solr (or elasticsearch for XYZ reasons). First step would be to remove all Lucene dependencies in rdf.cris package and push implementation in rdf.cris.lucene package

- Lucene 4.x Branch

There are a large number of changes since the 2.x and 3.x branch of Lucene. I'd propose a small refactor and overhaul of the rdf.cris.lucene package to take advantage of Lucene's new features (Facets, SearchManager, …)

- Solr Implementation

In line with "in production" I strongly believe clerezza's CRIS component should be able to leverage established services without having to manage their scalability. That goes for FullText Search most obviously. The idea is to be able to use a remote Solr Server (Solr since it comes with the whole pseudo-rest servicing on top of Lucene).

- Fine Grained Search

As a logical evolution from the points above, it would be then perfect if clerezza's fulltext search capabilities could benefit from all the features of Lucene/Solr. I am especially thinking about:

-- Field/Analyzer specialisation (we don't compare authors, dates and text in the same way in Lucene/Solr)

-- Boosting (For IR, the title of a document usually yields more important information than its footnotes)

-- Advanced facets (things like date-rage facets, pivot facets (called 2nd level facets in fusepool))

-- Geolocalised searches (big thing in Lucene/Solr 4.x branch… would eventually be a nice to have)

I will execute this work over the next few weeks/months as part of the fusepool project, but most of all I would be pleased and interested to finally get a top-notch implementation of cross rdf-text solution. Very much looking forward for your feedback and hopefully support ;)

PS: who ever initiated the GraphIndexer implementation did an excellent job! Will hopefully follow in his footsteps!

Cheers,

_Stephane

Search in rdf.cris

Reply via email to