[
https://issues.apache.org/jira/browse/JCR-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084283#comment-13084283
]
Jukka Zitting commented on JCR-2959:
------------------------------------
If I understand correctly, the implementation in the patch still doesn't use
the actual fields in Lucene for searching, it just loads the node using
Session.getNodeByIdentifier() and uses the OperandEvaluator to get the values
by which documents are to be compared. It seems to me that this is pretty much
equivalent in terms of disk accesses to what the existing sorting mechanism
does and the speedup you're seeing is probably simply caused by the sorting
mechanism memorizing the values returned by the OperandEvaluator. I would
expect that difference to fade away if the test set becomes larger than what
fits into memory.
Some math to give us more background: The execution of a single-selector query
currently consists of the following main step:
1. Constructing the Lucene query
2. Executing the Lucene query
3. Fetching all matching nodes to memory
4. Sorting the nodes
Step 1 is just an in-memory mapping, so not too time-consuming. Steps 2 and 3
are probably the most time-consuming bits here, but both outside the scope of
this issue. Finally, step 4 consists of O(n log n) in-memory operations, where
n is the number of returned nodes. That's some time, but as these are in-memory
operations the overhead should be pretty small compared to the previous steps,
whose worst case behavior is O(n) disk accesses.
The proposed change here is to switch steps 3 and 4 around and use Lucene for
the sorting. The time required to sort all the results is still O(n log n) and
may well require some extra disk accesses to fetch the sort fields from the
Lucene index. However, the time taken in this step should still be
significantly lower than in fetching all the matching nodes to memory. Thus the
big benefit that we can gain from this optimization comes in cases where we
*don't* want to fetch all the matching nodes to memory.
A good test case for such a scenario could be a data set of 1m nodes, of which
a query targets a subset of 100k nodes, and only the first 100 nodes of the
sorted results are actually being read. With a proper implementation of this
mechanism, the Lucene search would find all the matching 100k documents and
read the sort fields (in the Lucene index) of those documents for quickly
ordering the results. That should still be pretty fast compared to fetching all
the 100k nodes into memory, which is what the current implementation (and the
current patch) does. And then fetching only 100 first results should take no
time at all. I'd expect us to reach up to an order of magnitude of performance
improvement for such a scenario.
> SQL2 QueryEngine should use Lucene for sort
> -------------------------------------------
>
> Key: JCR-2959
> URL: https://issues.apache.org/jira/browse/JCR-2959
> Project: Jackrabbit Content Repository
> Issue Type: Improvement
> Components: jackrabbit-core
> Affects Versions: 2.2.1
> Reporter: Robert Seidel
> Priority: Minor
> Attachments: JCR-2959.patch
>
>
> Currently all SQL2 queries with "order by" expression are sorted with
> Collections.sort - which is slow and very bad if there are lots of hits.
> Lucene should be used for sorting hits instead.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira