Toke Eskildsen created LUCENE-6828:
--------------------------------------

             Summary: Speed up requests for many rows
                 Key: LUCENE-6828
                 URL: https://issues.apache.org/jira/browse/LUCENE-6828
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/search
    Affects Versions: 5.3, 4.10.4
            Reporter: Toke Eskildsen
            Priority: Minor


Standard relevance ranked searches for top-X results uses the HitQueue class to 
keep track of the highest scoring documents. The HitQueue is a binary heap of 
ScoreDocs and is pre-filled with sentinel objects upon creation.

Binary heaps of Objects in Java does not scale well: The HitQueue uses 28 
bytes/element and memory access is scattered due to the binary heap algorithm 
and the use of Objects. To make matters worse, the use of sentinel objects 
means that even if only a tiny number of documents matches, the full amount of 
Objects is still allocated.

As long as the HitQueue is small (< 1000), it performs very well. If top-1M 
results are requested, it performs poorly and leaves 1M ScoreDocs to be garbage 
collected.

An alternative is to replace the ScoreDocs with a single array of packed longs, 
each long holding the score and the document ID. This strategy requires only 8 
bytes/element and is a lot lighter on the GC.

Some preliminary tests has been done and published at 
https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/
These indicate that a long[]-backed implementation is at least 3x faster than 
vanilla HitDocs for top-1M requests.

For smaller requests, such as top-10, the packed version also seems 
competitive, when the amount of matched documents exceeds 1M. This needs to be 
investigated further.

Going forward with this idea requires some refactoring as Lucene is currently 
hardwired to the abstract PriorityQueue. Before attempting this, it seems 
prudent to discuss whether speeding up large top-X requests has any value? 
Paging seems an obvious contender for requesting large result sets, but I guess 
the two could work in tandem, opening up for efficient large pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to