Re: [xwiki-devs] [Investigation] SOLR integration

Fabio Mancinelli Fri, 09 Sep 2011 02:17:59 -0700

On Fri, Sep 9, 2011 at 12:03 AM, Paul Libbrecht <p...@hoplahup.net> wrote:
> Fabio,
>
> interesting document and challenging mission!
>
Thanks Paul for your emal.


> There's a whole lot to tell about your document, but here's a few guts 
> feelings:
>
> - first, I think you should describe a few application scenarios in more 
> details; I think you'd come with the conclusion that both an EmbeddedServer 
> and a Solr WebApp (inside server or outside) make sense. It looks like this 
> decision is not needed now as Solrj offers you abstraction.
>
Well, this investigation was a starting point for understanding SOLR
and how it could be possibly used to improve XWiki search features.
I agree about the description of the application scenarios but I was
counting on the community to help on this as well :)
I just wrote in the document something interesting I found (I never
used SOLR before and I am discovering it now :))

> - I am fearing you do not get all the benefits with EmbeddedServer, in 
> particular the caching and auto-warming but that seems not to be the case:
>  http://lucene.472066.n3.nabble.com/Embedded-Server-Caching-Stats-page-updates-td827632.html
>
AFAIU the EmbeddedSolrServer is equivalent to the web application.
The problem is that using the EmbeddedSolrServer is easier from an
integration point of view (it's just a matter of declaring some
dependencies in the pom.xml) while using a WAR version would mean to
"merge" SOLR web application with the XWiki one which could be a more
difficult task.

So the first investigation focused on the EmbeddedSolrServer.

This, however, doesn't prevent to use an external SOLR server on some
deployment :

SolrServer server = getSolrServer();
public SolrServer getSolrServer(){
    if(some options are specified in the xwiki.cfg) return new
CommonsHttpSolrServer();
    else return new EmbeddedSolrServer();
}

The SolrJ APIs are the same for the two components so it's just a
matter of choosing the right implementation.

> - Your query examples are pretty hairy I find. Joe-bo users want to "just 
> type" and find relevance ranked results. Solr supports this well with the 
> DisMax query handler (it allows to put a higher rank on title for example, 
> than on body, than on attachments...). I would say you need both (the solr 
> web-app's default query handler allows both with an extra prefix). Another 
> major advantage, which the lucene plugin missed is that you can have one 
> field that is "stemmed" and a copy of it that is not. A match in the exact 
> field would rank higher.
>

Yes, I looked at it. The solrconfig.xml allows you to tune a lot of details.
My first idea was: "let's see what we can do with a minimal
solrconfig.xml, the one that could end up packaged with a standard XE
distribution if we decide to bundle solr"

It's clear that, given the power of SOLR, we will need at some point
to provide the user/administrator the mechanisms to tune the
configuration of SOLR (for example. a French site might be interested
in using a different type of analysers, tokenizer, etc. for the
analysis)

Though I think that it should be done in a way that the user interface
stays the same.

> - In all applications I've worked on, indexing pages when they change is not 
> enough because they are pages that depend on others... this needs to be 
> addressed at the application level (think, e.g. about the dashboard, about 
> "book" pages that enclose others): re-index triggers.
>
This could be done in the component logic.

> Another crucial aspect is to stimulate anyone working on a particular schema 
> to be economic. The biggest flaw of the xwiki-lucene-module is that it 
> indexed and stored everything... that meant that a single result document was 
> quite big. Storing typically is probably not useful.
>
Yep. If you store everything you will duplicate your XWiki database :)
The schema.xml is a delicate point because once it's decided it should
be freezed because the fields declared will then be used by other
component via the API to retrieve the returned information.

I've found interesting the fact that you can declare dynamic fields
which are associated to a given type using a prefix/suffix. This could
be used as a way to extending the schema at runtime if an application
needs to.

> - particular scenarios will have particular UIs. Would you sketch one that 
> would be default for 3.2? Would authors be facets? spaces?
>
UI is another tricky point.

I am thinking about a "standard distribution", that is, how a UI
leveraging SOLR as a search engine should appear if SOLR is integrated
in XWiki by default.
So basically the basic scenario is: everything in the UI stays the
same and we just change the engine under the hood.

However the fact that SOLR has a lot of interesting features (e.g.,
facets) might drive the *standard* search UI towards some
improvements.

For example, as you suggested, spaces and authors could be interesting
facets, but I would say also dates.

This is an open discussion.

>From your question I also understand that you are suggesting a way to
customize the UI in order to take into account particular search
scenarios.
This would be great but I have no idea, at this point, about how to do
it, and if it's really interesting in adding this flexibility in the
standard distribution.

Afterall if your scenario is that particular you can always write an
application that uses a custom solrconfig.xml and schema and UI :)

> - I would suggest to enter best practice as soon as possible: make 
> evaluations possible per default. A typical evaluation would be run by a 
> content expert that would know his documents and would invent a few queries 
> (e.g. reading the logs) and check the correct or incorrect results, that'd 
> give mean precision and recall at each of the results, something you can then 
> collect and tabulate to assess the "mean" quality of a search engine (that 
> paper: 
> http://www.oracleimg.com/technetwork/database/enterprise-edition/imt-quality-092464.html
>  explains this well). I'm just back from a summer school on Information 
> Retrieval and there's a lot there.
>
I see where you are heading :)

Well, this investigation was more modest.
As I said, it was just to try to understand if/how we could use SOLR
as the default search infrastructure for the default XWiki
distribution.

It's clear that indexing/searching should be tuned with respect to the
domain. It would be good to make the integration so flexible that
these tuning could be taken into account. Though for a first iteration
I think that would be too much :)

> I am sorry I cannot offer much time but I would love to lend a little hand.
>
Well, you your mail has been very very useful.

Thanks,
Fabio

> paul
>
> Le 6 sept. 2011 à 17:29, Fabio Mancinelli a écrit :
>
>> Hi everybody,
>>
>> for the 3.2 release cycle I said that I was going to investigate a bit
>> the SOLR search engine and how to use/integrate it in the current
>> platform.
>> I wrote a document that you can find here:
>> http://dev.xwiki.org/xwiki/bin/view/Design/SOLRIntegration about some
>> of the things I looked at.
>>
>> There is a lot of room for discussion/improvement but I think the
>> document is already a good starting point.
>>
>> Feedback is welcome.
>>
>> Thanks,
>> Fabio
>> _______________________________________________
>> devs mailing list
>> devs@xwiki.org
>> http://lists.xwiki.org/mailman/listinfo/devs
>
> _______________________________________________
> devs mailing list
> devs@xwiki.org
> http://lists.xwiki.org/mailman/listinfo/devs
>
_______________________________________________
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [Investigation] SOLR integration

Reply via email to