Anca Paula Luca wrote: > Hi devs, > > my research about Lucene querying for Watch > (http://jira.xwiki.org/jira/browse/XWATCH-116) from the past weeks brought me > to > the following results: > First of all, as I already mentioned, since Lucene is a fulltext indexing > engine > and Watch data is, even if wiki data, quite structured data (and so are its > queries), something feels not right in trying to do this. But the query speed > improvements are / should be significant so we could try to workaround -- > XWiki > Lucene plugin indexes object properties in documents as fields so basically, > a > structured search is plausible (with the right type of Lucene fields, etc). > Even like this, not all Watch SQL queries are fully translatable to Lucene > queries without having Watch specific Lucene querying code and / or Watch > specific Lucene indexing. > > Second, there is a problem with the Lucene reliability and "response time", > as > mentioned in the jira issue comments: > * there is a delay between the moment a document is created / modified and > the > moment the change is retrievable through Lucene queries, because first it > needs > to be indexed by Lucene. This fuzzyness, although acceptable in some > situations > (for example, retrieving the list of articles to show to the user) it is not > desired in situations like article properties updates (star, read, trash) or > feed adding / deleting -- "caching" these changes until they are retrievable > through Lucene querying is not at all an option. > * XWiki Lucene plugin seems quite buggy / unstable: a lot of exceptions > (server > restart seems to be fatal due to some hanged file lock), duplicate documents > in > results sometimes, missing results some time, all explainable and acceptable > in > the "fuzzy" situation of a fulltext search engine, but not when trying to use > it > for structured search. > > Despite all these, I devised some code to help me test Lucene and compare it > to > SQL in the real-watch situation, for the article list retrieval. The results > for > the time spent on the server (I assumed that the time taken to transport > documents and print them in the Reader is the same regardless of the querying > technique), for a mysql server with ~60000 articles in ~200 feeds, are: > * for the initial loading of the articles, in a newly started server, SQL can > take up to 30-40 seconds, Lucene takes "only" up to 20 (15-16) > * for the initial load of the interface, in a non-newly started server, it > takes > ~15 seconds for SQL and ~4-5 on average (but can go up to 10) for Lucene > * for a click on the All group, which it's basically the same query as for > the > initial load of the article list, it can go under a second for Lucene and > around > 7-8-9 seconds for SQL > * for a click on a feed with 1023 articles (therefore loading the list of > articles in a specific feed), SQL goes to 3 seconds while Lucene can take > from > under a second to a couple of seconds, depending on the time took to load the > actual documents corresponding to the search results > * for pagination navigation, Lucene takes a second on average and SQL 2-3 > seconds. > > Note that Lucene retrieval still uses the database and SQL access because its > results are LuceneDocuments (that hold names of XWikiDocuments), not > XWikiDocuments. > > All the tests above were made with a server running on my computer with a > 5.0.51a mysql version, a 1.6-SNAPSHOT XWiki version , java 1.5.0_15 version, > AMD > Turion64 x 2 TL-56, 1.8 GB RAM, but with other applications running too. > > I feel that, overall, Lucene querying improvements are not so spectacular, > especially because it cannot solve all situations and we would still need SQL > querying in some cases, and because of its relative instability (which we > could > think about fixing, though). > > The other option for performance improvements in Watch would be to have a > Watch > specialized server (as we've already discussed) which would allow, amongst > other > things, to use Watch specific SQL queries (as opposed to now when we use > generic > queries since we have to go through the XWiki GWT API) and optimize as much > as > possible at that level. I haven't tested yet the amount of improvement but, > since I think we might be able to drop some tables from some of the SQL joins > we're doing right now, it should be better.
I also did the tests for the specific Watch queries with a Watch server and the results are as follows: * initial load on newly started server: it seems to be taking around 10 seconds (which is even lower than Lucene?) * initial load: around 7 seconds * click "All" group: 4-5-6 on average (as low as 3 but as high as 7 as well) * load large feed with now 1120 articles: under a second (0.7-0.8) * pagination navigation: big variations, from times as low as half of second, to times as high as 3-4 seconds. I'd say 2-3 on average. Indeed, I managed to drop a table from the select, but besides this I didn't otherwise try to optimize it for these tests. The only indices I have in my database are the ones in the database administration guide at: http://platform.xwiki.org/xwiki/bin/view/AdminGuide/Database+Administration with no indexes whatsoever on the feeds custom mapping tables. Since now there are only two tables left in the articles query, join optimization (with indexes) are possible more than before and they could bring even more improvement to the times -- as we can already see in the times for loading a single feed: that very good time might be a result of the "join where" optimization of mysql. Happy coding, Anca Luca > Of course, this kind of approach > requires heavy refactor, and potentially complete rewriting and > rearchitecturing > of some pieces. > > Despite the coding challenge, I'd go for the second approach, WDYT? > > Happy coding, > Anca Luca > _______________________________________________ > devs mailing list > [email protected] > http://lists.xwiki.org/mailman/listinfo/devs _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

