Hi everyone,
I had a look at the search related code during the last days, because we need
better performance for range queries on date fields as well as for sorting by
date fields. These are my thoughts so far:
1. Wouldn't it make sense to exclude the index for the "jcr:system" tree (which
is located at repository/index by default) if the query to execute doesn't
include items from the "jcr:system" tree.
Take for example a query like "my:app//element(*, foo:bar)". This query only
searches for nodes located under "my:app" which excludes nodes from "jcr:system"
and therefore doesn't need to search in the "jcr:system" index.
As the "jcr:system" might grow quite quickly if you create a lot versions it
might be worth to exclude it.
I'm not sure though how hard it would be to find out if a query needs to include
the "jcr:system" index.
2. Lucene uses the FieldCaches to speed up sorting and range queries which is
exactly what we are after. Those FieldCaches are per IndexReader.
Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is
most likely to be an instance of CachingMultiReader. So on every search which
builds up a FieldCache this FieldCache instance is associated with this instance
of a CachingMultiReader. On successive queries which operate on this
CachingMultiReader you will get a tremendous speedup for queries which can reuse
those associated FieldCache instances.
The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one
of the underlying indexes are modified. This means if you just change _one_ item
in the repository you will need to rebuild all those FieldCaches because the
existing FieldCaches are associated with the old instance of CachingMultiReader.
This does not only lead to slow search response times for queries which contains
range queries or are sorted by a field but also leads to massive memory
consumption (depending on the size of your indexes) because there might be
multiple instances of CachingMultiReaders in use if you have a scenario where a
lot of queries and item modifications are executed concurrently.
As far as I understand the solution is to use a MultiSearcher which uses
multiple IndexReaders. Since due to the merging strategy most of the indexes are
stable this means the FieldCaches can be used for a much longer time.
I just tried to quickly modify SearchIndex to use a MultiSearcher with multiple
IndexReaders wrapped by IndexSearchers but wasn't successful because somewhere
in DescendantSelfAxisWeight the index readers are required to implement
HierarchyResolver which ReadOnlyIndexReader doesn't.
So I thought I might ask you for some insight what you think about those two
ideas before spending to much time walking down the wrong way ;)
Cheers,
Christoph
- Optimize search performance Christoph Kiehl
-