Re: [xwiki-devs] [Investigation] SOLR integration

Paul Libbrecht Wed, 14 Sep 2011 00:40:47 -0700

Le 9 sept. 2011 à 11:17, Fabio Mancinelli a écrit :
>> - first, I think you should describe a few application scenarios in more 
>> details; I think you'd come with the [..]
> Well, this investigation was a starting point for understanding SOLR
> and how it could be possibly used to improve XWiki search features.
> I agree about the description of the application scenarios but I was
> counting on the community to help on this as well :)
> I just wrote in the document something interesting I found (I never
> used SOLR before and I am discovering it now :))


We should link to existing usage of SOLR in the wild and at the competition.
Among others, it's the search-engine of drupal.org ;-).

> The problem is that using the EmbeddedSolrServer is easier from an
> integration point of view (it's just a matter of declaring some
> dependencies in the pom.xml) while using a WAR version would mean to
> "merge" SOLR web application with the XWiki one which could be a more
> difficult task.

I would surely separate the wars!

> So the first investigation focused on the EmbeddedSolrServer.
> This, however, doesn't prevent to use an external SOLR server on some
> deployment :
> [...]
> The SolrJ APIs are the same for the two components so it's just a
> matter of choosing the right implementation.

I agree no worry there.

>> - Your query examples are pretty hairy I find. Joe-bo users want to "just 
>> type" and find relevance ranked results. Solr supports this well with the 
>> DisMax query handler (it allows to put a higher rank on title for example, 
>> than on body, than on attachments...). I would say you need both (the solr 
>> web-app's default query handler allows both with an extra prefix). Another 
>> major advantage, which the lucene plugin missed is that you can have one 
>> field that is "stemmed" and a copy of it that is not. A match in the exact 
>> field would rank higher.
> 
> Yes, I looked at it. The solrconfig.xml allows you to tune a lot of details.
> My first idea was: "let's see what we can do with a minimal
> solrconfig.xml, the one that could end up packaged with a standard XE
> distribution if we decide to bundle solr"

then please make Dismax the default query type!
Include a checkbox for an advanced query so that you can change the query type 
for a query with all sorts of field names details.

> It's clear that, given the power of SOLR, we will need at some point
> to provide the user/administrator the mechanisms to tune the
> configuration of SOLR (for example. a French site might be interested
> in using a different type of analysers, tokenizer, etc. for the
> analysis)
> Though I think that it should be done in a way that the user interface
> stays the same.

Correct. I would like to help you at this very point where "european" software 
is far better than american ones: internationalization must be from version 
zero on. I would like to provide a basic language dependent functionality so 
that a minimum fuzzyness is supported in the default query type but that is 
avoided if you make precise queries.

Here's the proposal. We'd make fields such as:
- text: full-text, exact tokens (whitespace analyzer)
- text_standard: full-text, standard-analyzer (e.g. best for emails and URLs)
- text_fr: stemmed with the french analyzer (filled if the document is 
recognized to be french)
- text_de: ...
- text_bits: makes any non-letter a token separator
Same with title_*

And the dismax's qf parameter would be something such as

title^3 title_standard^2 title_fr^1.5 title_en^1.5 title_de^1.5 title_bits^1.2 
text^3 text_standard^2 text_fr^1.5 text_en^1.5 text_de^1.5 text_bits^1.2

This way if you ask for a chevaux you find a document with cheval but documents 
with chevaux come first, especially if in the title. Documents with a URL that 
contains chevaux would also match, but after that.

Enabling debug and explanation can give you the details of each such match 
which is useful to understand.

>> - In all applications I've worked on, indexing pages when they change is not 
>> enough because they are pages that depend on others... this needs to be 
>> addressed at the application level (think, e.g. about the dashboard, about 
>> "book" pages that enclose others): re-index triggers.
> This could be done in the component logic.

What I am saying is that you need a way for the application designers to add an 
"indexing listener" that would support that type of callback.

>> Another crucial aspect is to stimulate anyone working on a particular schema 
>> to be economic. The biggest flaw of the xwiki-lucene-module is that it 
>> indexed and stored everything... that meant that a single result document 
>> was quite big. Storing typically is probably not useful.
>> 
> Yep. If you store everything you will duplicate your XWiki database :)
> The schema.xml is a delicate point because once it's decided it should
> be freezed because the fields declared will then be used by other
> component via the API to retrieve the returned information.
> 
> I've found interesting the fact that you can declare dynamic fields
> which are associated to a given type using a prefix/suffix. This could
> be used as a way to extending the schema at runtime if an application
> needs to.

That seems to be useful only for language dependent fields as above indeed.

> 
>> - particular scenarios will have particular UIs. Would you sketch one that 
>> would be default for 3.2? Would authors be facets? spaces?
>> 
> UI is another tricky point.
> 
> I am thinking about a "standard distribution", that is, how a UI
> leveraging SOLR as a search engine should appear if SOLR is integrated
> in XWiki by default.
> So basically the basic scenario is: everything in the UI stays the
> same and we just change the engine under the hood.

I would add the two following checkboxes:
- advanced (use qt=lucene then)
- debug (then show little links on each result which shows the result of the 
explanation: response.getExplainMap())


> However the fact that SOLR has a lot of interesting features (e.g.,
> facets) might drive the *standard* search UI towards some
> improvements.
> For example, as you suggested, spaces and authors could be interesting
> facets, but I would say also dates.

True but note that 3 facets dimension is already big.

If time allows, highlighting is really quite fundamental as well for a trust in 
the search engine something which has been quite low in the XWiki community and 
tools since the switch to Lucene.

> This is an open discussion.
> From your question I also understand that you are suggesting a way to
> customize the UI in order to take into account particular search
> scenarios.
> This would be great but I have no idea, at this point, about how to do
> it, and if it's really interesting in adding this flexibility in the
> standard distribution.

Well, I think the right way to do that is to leave sufficiently many java 
objects available and documented.

I've indicated above a listener about indexing-decisions.
I think another aspect that is required is to leave it possible for 
applications to enrich or make poorer the index documents before they go into 
the index.
Both of these tasks should be doable from Groovy.

My old suggestion would be to add a listener this way:
  xwiki.solr.addIndexListener(xwiki.parseGroovyFromPage("MyApp.IndexListener"))
but maybe components do that better.

An IndexListener would implement an interface IndexListener with such methods 
as:

// note: no re-entrant!
// modifies the list of documents to be indexed
void notifyDocumentsWillBeIndexed(List docFullNames)

// modifies the SolrJ document
void notifyDocumentBeingIndexed(Document solrDoc)


another customization could be at query time but I am not sure it's that easy 
here (I had to write a dedicated query processor).

> Afterall if your scenario is that particular you can always write an
> application that uses a custom solrconfig.xml and schema and UI :)

Make sure that is possible without changes to the software!

Could you give details on how and where to install xwiki-platform-searchs-solr?
(I'm old fashioned, these modern xwiki installs seem to easy to me)

paul
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [Investigation] SOLR integration

Reply via email to