Hi devs,

my research about Lucene querying for Watch 
(http://jira.xwiki.org/jira/browse/XWATCH-116) from the past weeks brought me 
to 
the following results:
First of all, as I already mentioned, since Lucene is a fulltext indexing 
engine 
and Watch data is, even if wiki data, quite structured data (and so are its 
queries), something feels not right in trying to do this. But the query speed 
improvements are / should be significant so we could try to workaround --  
XWiki 
Lucene plugin indexes object properties in documents as fields so basically, a 
structured search is plausible (with the right type of Lucene fields, etc).
Even like this, not all Watch SQL queries are fully translatable to Lucene 
queries without having Watch specific Lucene querying code and / or Watch 
specific Lucene indexing.

Second, there is a problem with the Lucene reliability and "response time", as 
mentioned in the jira issue comments:
* there is a delay between the moment a document is created / modified and the 
moment the change is retrievable through Lucene queries, because first it needs 
to be indexed by Lucene. This fuzzyness, although acceptable in some situations 
(for example, retrieving the list of articles to show to the user) it is not 
desired in situations like article properties updates (star, read, trash) or 
feed adding / deleting -- "caching" these changes until they are retrievable 
through Lucene querying is not at all an option.
* XWiki Lucene plugin seems quite buggy / unstable: a lot of exceptions (server 
restart seems to be fatal due to some hanged file lock), duplicate documents in 
results sometimes, missing results some time, all explainable and acceptable in 
the "fuzzy" situation of a fulltext search engine, but not when trying to use 
it 
  for structured search.

Despite all these, I devised some code to help me test Lucene and compare it to 
SQL in the real-watch situation, for the article list retrieval. The results 
for 
the time spent on the server (I assumed that the time taken to transport 
documents and print them in the Reader is the same regardless of the querying 
technique), for a mysql server with ~60000 articles in ~200 feeds, are:
* for the initial loading of the articles, in a newly started server, SQL can 
take up to 30-40 seconds, Lucene takes "only" up to 20 (15-16)
* for the initial load of the interface, in a non-newly started server, it 
takes 
~15 seconds for SQL and ~4-5 on average (but can go up to 10) for Lucene
* for a click on the All group, which it's basically the same query as for the 
initial load of the article list, it can go under a second for Lucene and around
7-8-9 seconds for SQL
* for a click on a feed with 1023 articles (therefore loading the list of 
articles in a specific feed), SQL goes to 3 seconds while Lucene can take from 
under a second to a couple of seconds, depending on the time took to load the 
actual documents corresponding to the search results
* for pagination navigation, Lucene takes a second on average and SQL 2-3 
seconds.

Note that Lucene retrieval still uses the database and SQL access because its 
results are LuceneDocuments (that hold names of XWikiDocuments), not 
XWikiDocuments.

All the tests above were made with a server running on my computer with a 
5.0.51a mysql version, a 1.6-SNAPSHOT XWiki version , java 1.5.0_15 version, 
AMD 
Turion64 x 2 TL-56, 1.8 GB RAM, but with other applications running too.

I feel that, overall, Lucene querying improvements are not so spectacular, 
especially because it cannot solve all situations and we would still need SQL 
querying in some cases, and because of its relative instability (which we could 
think about fixing, though).

The other option for performance improvements in Watch would be to have a Watch 
specialized server (as we've already discussed) which would allow, amongst 
other 
things, to use Watch specific SQL queries (as opposed to now when we use 
generic 
queries since we have to go through the XWiki GWT API) and optimize as much as 
possible at that level. I haven't tested yet the amount of improvement but, 
since I think we might be able to drop some tables from some of the SQL joins 
we're doing right now, it should be better. Of course, this kind of approach 
requires heavy refactor, and potentially complete rewriting and 
rearchitecturing 
of some pieces.

Despite the coding challenge, I'd go for the second approach, WDYT?

Happy coding,
Anca Luca
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Reply via email to