Hi Stephane, sorry for the late response.
2013/10/3 Stephane Gamard <[email protected]> > Thank you Tommaso, > > I might need help or at the very least simple pointers and debates over > certain principles and guidelines. > > First one being: the choice to either abstract everything related to > search (such as Sorting fields, query, filters and facets) or to use the > Lucene native objects. Small overview of pros and cons (for the rdf.cris > package, not the implemenation packages). > yes that's usually one of the biggest challenges when search is not part of the core architecture infrastructure. I tend to prefer the more abstract way of doing things, with an eye on having generic yet flexible APIs as most as possible. At the same time having a number of use cases and implementation features that one wants to leverage may be a good drive for designing such APIs. > > *Native Lucene* > + Objects already exists, well implemented (SortField, Facet, …) > - Bounds to lucene semantics (fairly easy to use but certain impl > providers will have to rewrite using Lucene translation… In case someone > wants to make a "Fast" or GSA impl for clerezza). Note that Lucene, Solr > and Elastic can fairly easily work with Native Lucene Objects > +/- Should put all search-ability logic into helper classes as to not > force external package to talk "Lucene" > > *Abstracted Classes* > - LOT of re-coding concepts that are straight forward in Lucene > + No Lucene dependancies and no need of helper classes > + Not bound to anything impl, rewrite for possible solr, GSA, fast, … will > not require basic knowledge of Lucene. > > I'd be interested on you POV on this. My Main goal is for ppl outside of > the rdf.cris package never having to learn any specialised API while yet > taking advantage of all the IR features of any search engine. > > I think this last requirement goes in the direction of more abstract design. Maybe a good compromise for starting would be sketching up an API, extend / implement a couple of use cases with Lucene, enhance the API, and iterate a bunch of times till we're satisfied with it. My 2 cents, Tommaso > _Stephane > > > On October 3, 2013 at 1:59:07 PM, Tommaso Teofili ( > [email protected]) wrote: > > Hi Stephane, > > I don't have much time now but I just wanted to let you know that IMHO > your > list of goals / tasks sounds completely reasonable, in case you need it I > may be able to give some help along the next weeks. > > Regards, > Tommaso > > > 2013/10/2 Stephane Gamard <[email protected]> > > > Hi Team, > > > > My name's Stephane and I am currently participating to the Fusepool FP7 > > project. Within this project we are using stanbol & clerezza as key > > architectural components. Coming from a pure FullText search and > > Information Retrieval background I find myself in somewhat of a new > > territory. > > > > But within that new territory there is a link to my area of expertise: > > Lucene/Solr in the rdf.cris package. This package turns out to be > crucial > > for our project and I would gladly participate and contribute my > knowledge > > as a Lucene and Solr developer. So here in a nutshell a list of "small > > contributions" to start with: > > > > - Abstraction Refactoring > > Currently CRIS is using Lucene as its FT engine, but we might want to > > eventually go to Solr (or elasticsearch for XYZ reasons). First step > would > > be to remove all Lucene dependencies in rdf.cris package and push > > implementation in rdf.cris.lucene package > > > > - Lucene 4.x Branch > > There are a large number of changes since the 2.x and 3.x branch of > > Lucene. I'd propose a small refactor and overhaul of the rdf.cris.lucene > > package to take advantage of Lucene's new features (Facets, > SearchManager, > > …) > > > > - Solr Implementation > > In line with "in production" I strongly believe clerezza's CRIS > component > > should be able to leverage established services without having to manage > > their scalability. That goes for FullText Search most obviously. The > idea > > is to be able to use a remote Solr Server (Solr since it comes with the > > whole pseudo-rest servicing on top of Lucene). > > > > - Fine Grained Search > > As a logical evolution from the points above, it would be then perfect > if > > clerezza's fulltext search capabilities could benefit from all the > features > > of Lucene/Solr. I am especially thinking about: > > -- Field/Analyzer specialisation (we don't compare authors, dates and > text > > in the same way in Lucene/Solr) > > -- Boosting (For IR, the title of a document usually yields more > important > > information than its footnotes) > > -- Advanced facets (things like date-rage facets, pivot facets (called > 2nd > > level facets in fusepool)) > > -- Geolocalised searches (big thing in Lucene/Solr 4.x branch… would > > eventually be a nice to have) > > > > I will execute this work over the next few weeks/months as part of the > > fusepool project, but most of all I would be pleased and interested to > > finally get a top-notch implementation of cross rdf-text solution. Very > > much looking forward for your feedback and hopefully support ;) > > > > PS: who ever initiated the GraphIndexer implementation did an excellent > > job! Will hopefully follow in his footsteps! > > > > Cheers, > > > > _Stephane > > > > > >
