RE: Re: Search performance : MultiIndex

Ard Schrijvers Mon, 29 Oct 2007 02:42:48 -0800

> Christoph Kiehl wrote:
> But as you mentioned in your previous mail there are some 
> problematic queries which are way to slow like ChildAxisQuery 
> or DescendantSelfAxisQuery. All queries that need to read 
> lucene documents instead of just using a query get pretty 
> slow with large repositories. But I didn't see a way yet how 
> to substantially improve performance while using lucene. I 
> even thought of using some other kind of indexing since lucene ...
> Internally we use a specific mixin for our documents as a 
> workaround. This way I can avoid ChildAxisQueries and the 
> like. I just query for "//element(*, foo-mix:document)[...]" 
> for example. But that is just a dirty workaround.


This argument holds for my solution as well :-(
 
> I would really like to find a solution to those problems. 
> Maybe we should use some additional kind of index for 
> resolving parent-child relations. Do you have any ideas yet 
> how improve performance in those areas?

AFAICS, when we want to solve it within lucene with querying, we will
have a trade-off between "fast searching" and fast "moving of nodes"
(I'll get back on this one) 

Currently, we are building a layer on top of JackRabbit that amongst
many other things at least needs to be able to:

1) port legacy code which had slide as repository
2) show all documents/nodes through faceted navigation 

Since we have quite many large projects running with slide as
repository, and since we use a custom slide/lucene index to be able to
search fast, I need some queries in JackRabbit to be much faster than
currently possible. Obviously, since (2) must be implemented, almost
every call to JackRabbit will be a search. A very basic search we have
hundreds of times for legacy projects with slide would be:

/documents/en/news//[EMAIL PROTECTED] order by @modificationDate

Typically, a news folder contains tens of thousands of items, and this
query is not possible with the current JackRabbit impl (at least, my
experience is that for > 10.000 docs this query takes multiple seconds,
while I need the result in  < 50ms (50 is really the max IMO) ).

Now, I chose that for some queries that I control exactly, so I know I
won't have queries like /documents/en[1]/news[1] or
documents/[EMAIL PROTECTED]/news or documents/*/news, but only queries that
look like /nodename/nodename/nodename/**[......]  that I translate the
initial part to something like:

TermQuery(new Term(FieldNames.INITIAL_PATH, path)) where for example
path='/documents/en/news' 

Obviously, this only works when I index a node's path in some lucene
field. So a node with path /documents/en/news/2007/10/14/item.xml

would have lucene Field that contains the terms

'/documents/en/news/2007/10/14/item.xml'
'/documents/en/news/2007/10/14'
'/documents/en/news/2007/10'
'/documents/en/news/2007'
'/documents/en/news'
'/documents/en'
'/documents'

Obviously, this results in very fast simple lucene search for 'give me
all nodes starting with path x' because it is just 1 simple TermQuery,
but as a major disadvantage, it is now very costly to move a node,
because this requires re-indexing the tree below that node. Also, I can
only use it for queries with a basic 'start-path', though it might be
enhanced to suppose '*' and /[EMAIL PROTECTED] 

Bottomline, I haven't found the holy grail either, but at least I have
responses within ms for hundreds of thousands of nodes :-) I am not sure
if there is a solution for fast searching for DescendantSelfAxisQuery
and at the same time fast moving of nodes. I choose to be able to search
fast, and hope people won't be moving the node directly under the root
to many times :-) 

Regards Ard

> 
> Cheers,
> Christoph
> 
>

RE: Re: Search performance : MultiIndex

Reply via email to