Jamie [ja...@mailarchiva.com] wrote:
> It would be nice if, in future, the Lucene API could provide a
> searchAfter that takes a position (int).

It would not really help with large result sets. At least not with the current 
underlying implementations. This is tied into your current performance problem, 
if I understand it correctly.

We seem to have isolated your performance problems to large (10M+) result sets, 
right?

Requesting the top X results in Lucene works internally by adding to a Priority 
Queue. The problem with PQs is that they work really well for small result sets 
and really bad for large result sets (note that result set refers to the 
collected documents, not to the amount of matching documents). PQs rearranges 
the internal structure each time a hit is entered that has a score >= the 
lowest known score. With millions of documents in the result set, this happens 
all the time. Abstractly there is little difference between small result sets 
and large: O(n * log n) is fine scaling. In reality the rearrangements of the 
internal heap structure only works well when it is in CPU cache.


To test this, I created the tiny project https://github.com/tokee/luso 

It simulates the workflow (for an extremely loose value of 'simulates') you 
described with extraction of a large result set by filling a PQ of a given size 
with docIDs (ints) and scores (floats) and then extracting the ordered docIDs. 
Running it with different sizes shows how the PQ deteriorates on a 4 core i7 
with 8MB level 2 cache:

MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="pq 1 1000 10000 10000000000 
1000000 5000000 10000000 20000000 30000000 40000000"

Starting 1 threads with extraction method pq
       1,000 docs in mean      15 ms,    66 docs/ms.
      10,000 docs in mean      47 ms,   212 docs/ms.
     100,000 docs in mean      65 ms, 1,538 docs/ms.
     500,000 docs in mean     385 ms, 1,298 docs/ms.
   1,000,000 docs in mean     832 ms, 1,201 docs/ms.
   5,000,000 docs in mean   7,566 ms,   660 docs/ms.
  10,000,000 docs in mean  16,482 ms,   606 docs/ms.
  20,000,000 docs in mean  39,481 ms,   506 docs/ms.
  30,000,000 docs in mean  80,293 ms,   373 docs/ms.
  40,000,000 docs in mean 109,537 ms,   365 docs/ms.

As can be seen, relative performance (docs/ms) drops significantly when the 
document count increases. To add insult to injury, this deterioration patters 
is optimistic as the test was the only heavy job on my computer. Running 4 of 
these tests in parallel (1 per core) we would ideally expect about the same 
speed, but instead we get

MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="pq 4 1000 10000 100000 500000 
1000000 5000000 10000000 20000000 30000000 40000000"

Starting 4 threads with extraction method pq
       1,000 docs in mean      34 ms,    29 docs/ms.
      10,000 docs in mean      70 ms,   142 docs/ms.
     100,000 docs in mean     102 ms,   980 docs/ms.
     500,000 docs in mean   1,340 ms,   373 docs/ms.
   1,000,000 docs in mean   2,564 ms,   390 docs/ms.
   5,000,000 docs in mean  19,464 ms,   256 docs/ms.
  10,000,000 docs in mean  49,985 ms,   200 docs/ms.
  20,000,000 docs in mean 112,321 ms,   178 docs/ms.
(I got tired of waiting and stopped after 20M docs)

The conclusion seems clear enough: Using PQ for millions of results will take a 
long time.


So what can be done? I added an alternative implementation where all the docIDs 
and scores are collected in two parallel arrays, then merge sorted after 
collection. That gave the results

MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="ip 1 1000 10000 100000 500000 
1000000 5000000 10000000 20000000 30000000 40000000"
Starting 1 threads with extraction method ip
       1,000 docs in mean      15 ms,    66 docs/ms.
      10,000 docs in mean      52 ms,   192 docs/ms.
     100,000 docs in mean      73 ms, 1,369 docs/ms.
     500,000 docs in mean     363 ms, 1,377 docs/ms.
   1,000,000 docs in mean     780 ms, 1,282 docs/ms.
   5,000,000 docs in mean   4,634 ms, 1,078 docs/ms.
  10,000,000 docs in mean   9,708 ms, 1,030 docs/ms.
  20,000,000 docs in mean  20,818 ms,   960 docs/ms.
  30,000,000 docs in mean  32,413 ms,   925 docs/ms.
  40,000,000 docs in mean  44,235 ms,   904 docs/ms.

Notice how the deterioration of relative speed is a lot less than for PQ. 
Running this with 4 threads gets us

 MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="ip 4 1000 10000 100000 500000 
1000000 5000000 10000000 20000000 30000000 40000000"
Starting 4 threads with extraction method ip
       1,000 docs in mean      35 ms,    28 docs/ms.
      10,000 docs in mean     221 ms,    45 docs/ms.
     100,000 docs in mean     162 ms,   617 docs/ms.
     500,000 docs in mean     639 ms,   782 docs/ms.
   1,000,000 docs in mean   1,388 ms,   720 docs/ms.
   5,000,000 docs in mean   8,372 ms,   597 docs/ms.
  10,000,000 docs in mean  17,933 ms,   557 docs/ms.
  20,000,000 docs in mean  36,031 ms,   555 docs/ms.
  30,000,000 docs in mean  58,257 ms,   514 docs/ms.
  40,000,000 docs in mean  76,763 ms,   521 docs/ms.

The speedup of the merge sorter relative to PQ increases with the size of the 
collected result. Unfortunately we're still talking minute-class with 60M 
documents. It all points to the conclusion that collecting millions of sorted 
document IDs should be avoided if at all possible. A searchAfter that takes a 
position would either need to use some clever caching or perform the giant 
sorted collection when called.

- Toke Eskildsen
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to