[ 
https://issues.apache.org/jira/browse/JCR-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084283#comment-13084283
 ] 

Jukka Zitting commented on JCR-2959:
------------------------------------

If I understand correctly, the implementation in the patch still doesn't use 
the actual fields in Lucene for searching, it just loads the node using 
Session.getNodeByIdentifier() and uses the OperandEvaluator to get the values 
by which documents are to be compared. It seems to me that this is pretty much 
equivalent in terms of disk accesses to what the existing sorting mechanism 
does and the speedup you're seeing is probably simply caused by the sorting 
mechanism memorizing the values returned by the OperandEvaluator. I would 
expect that difference to fade away if the test set becomes larger than what 
fits into memory.

Some math to give us more background: The execution of a single-selector query 
currently consists of the following main step:

    1. Constructing the Lucene query
    2. Executing the Lucene query
    3. Fetching all matching nodes to memory
    4. Sorting the nodes

Step 1 is just an in-memory mapping, so not too time-consuming. Steps 2 and 3 
are probably the most time-consuming bits here, but both outside the scope of 
this issue. Finally, step 4 consists of O(n log n) in-memory operations, where 
n is the number of returned nodes. That's some time, but as these are in-memory 
operations the overhead should be pretty small compared to the previous steps, 
whose worst case behavior is O(n) disk accesses.

The proposed change here is to switch steps 3 and 4 around and use Lucene for 
the sorting. The time required to sort all the results is still O(n log n) and 
may well require some extra disk accesses to fetch the sort fields from the 
Lucene index. However, the time taken in this step should still be 
significantly lower than in fetching all the matching nodes to memory. Thus the 
big benefit that we can gain from this optimization comes in cases where we 
*don't* want to fetch all the matching nodes to memory.

A good test case for such a scenario could be a data set of 1m nodes, of which 
a query targets a subset of 100k nodes, and only the first 100 nodes of the 
sorted results are actually being read. With a proper implementation of this 
mechanism, the Lucene search would find all the matching 100k documents and 
read the sort fields (in the Lucene index) of those documents for quickly 
ordering the results. That should still be pretty fast compared to fetching all 
the 100k nodes into memory, which is what the current implementation (and the 
current patch) does. And then fetching only 100 first results should take no 
time at all. I'd expect us to reach up to an order of magnitude of performance 
improvement for such a scenario.

> SQL2 QueryEngine should use Lucene for sort
> -------------------------------------------
>
>                 Key: JCR-2959
>                 URL: https://issues.apache.org/jira/browse/JCR-2959
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>    Affects Versions: 2.2.1
>            Reporter: Robert Seidel
>            Priority: Minor
>         Attachments: JCR-2959.patch
>
>
> Currently all SQL2 queries with "order by" expression are sorted with 
> Collections.sort - which is slow and very bad if there are lots of hits. 
> Lucene should be used for sorting hits instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to