Jamie [ja...@mailarchiva.com] wrote: > It would be nice if, in future, the Lucene API could provide a > searchAfter that takes a position (int).
It would not really help with large result sets. At least not with the current underlying implementations. This is tied into your current performance problem, if I understand it correctly. We seem to have isolated your performance problems to large (10M+) result sets, right? Requesting the top X results in Lucene works internally by adding to a Priority Queue. The problem with PQs is that they work really well for small result sets and really bad for large result sets (note that result set refers to the collected documents, not to the amount of matching documents). PQs rearranges the internal structure each time a hit is entered that has a score >= the lowest known score. With millions of documents in the result set, this happens all the time. Abstractly there is little difference between small result sets and large: O(n * log n) is fine scaling. In reality the rearrangements of the internal heap structure only works well when it is in CPU cache. To test this, I created the tiny project https://github.com/tokee/luso It simulates the workflow (for an extremely loose value of 'simulates') you described with extraction of a large result set by filling a PQ of a given size with docIDs (ints) and scores (floats) and then extracting the ordered docIDs. Running it with different sizes shows how the PQ deteriorates on a 4 core i7 with 8MB level 2 cache: MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="pq 1 1000 10000 10000000000 1000000 5000000 10000000 20000000 30000000 40000000" Starting 1 threads with extraction method pq 1,000 docs in mean 15 ms, 66 docs/ms. 10,000 docs in mean 47 ms, 212 docs/ms. 100,000 docs in mean 65 ms, 1,538 docs/ms. 500,000 docs in mean 385 ms, 1,298 docs/ms. 1,000,000 docs in mean 832 ms, 1,201 docs/ms. 5,000,000 docs in mean 7,566 ms, 660 docs/ms. 10,000,000 docs in mean 16,482 ms, 606 docs/ms. 20,000,000 docs in mean 39,481 ms, 506 docs/ms. 30,000,000 docs in mean 80,293 ms, 373 docs/ms. 40,000,000 docs in mean 109,537 ms, 365 docs/ms. As can be seen, relative performance (docs/ms) drops significantly when the document count increases. To add insult to injury, this deterioration patters is optimistic as the test was the only heavy job on my computer. Running 4 of these tests in parallel (1 per core) we would ideally expect about the same speed, but instead we get MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="pq 4 1000 10000 100000 500000 1000000 5000000 10000000 20000000 30000000 40000000" Starting 4 threads with extraction method pq 1,000 docs in mean 34 ms, 29 docs/ms. 10,000 docs in mean 70 ms, 142 docs/ms. 100,000 docs in mean 102 ms, 980 docs/ms. 500,000 docs in mean 1,340 ms, 373 docs/ms. 1,000,000 docs in mean 2,564 ms, 390 docs/ms. 5,000,000 docs in mean 19,464 ms, 256 docs/ms. 10,000,000 docs in mean 49,985 ms, 200 docs/ms. 20,000,000 docs in mean 112,321 ms, 178 docs/ms. (I got tired of waiting and stopped after 20M docs) The conclusion seems clear enough: Using PQ for millions of results will take a long time. So what can be done? I added an alternative implementation where all the docIDs and scores are collected in two parallel arrays, then merge sorted after collection. That gave the results MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="ip 1 1000 10000 100000 500000 1000000 5000000 10000000 20000000 30000000 40000000" Starting 1 threads with extraction method ip 1,000 docs in mean 15 ms, 66 docs/ms. 10,000 docs in mean 52 ms, 192 docs/ms. 100,000 docs in mean 73 ms, 1,369 docs/ms. 500,000 docs in mean 363 ms, 1,377 docs/ms. 1,000,000 docs in mean 780 ms, 1,282 docs/ms. 5,000,000 docs in mean 4,634 ms, 1,078 docs/ms. 10,000,000 docs in mean 9,708 ms, 1,030 docs/ms. 20,000,000 docs in mean 20,818 ms, 960 docs/ms. 30,000,000 docs in mean 32,413 ms, 925 docs/ms. 40,000,000 docs in mean 44,235 ms, 904 docs/ms. Notice how the deterioration of relative speed is a lot less than for PQ. Running this with 4 threads gets us MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="ip 4 1000 10000 100000 500000 1000000 5000000 10000000 20000000 30000000 40000000" Starting 4 threads with extraction method ip 1,000 docs in mean 35 ms, 28 docs/ms. 10,000 docs in mean 221 ms, 45 docs/ms. 100,000 docs in mean 162 ms, 617 docs/ms. 500,000 docs in mean 639 ms, 782 docs/ms. 1,000,000 docs in mean 1,388 ms, 720 docs/ms. 5,000,000 docs in mean 8,372 ms, 597 docs/ms. 10,000,000 docs in mean 17,933 ms, 557 docs/ms. 20,000,000 docs in mean 36,031 ms, 555 docs/ms. 30,000,000 docs in mean 58,257 ms, 514 docs/ms. 40,000,000 docs in mean 76,763 ms, 521 docs/ms. The speedup of the merge sorter relative to PQ increases with the size of the collected result. Unfortunately we're still talking minute-class with 60M documents. It all points to the conclusion that collecting millions of sorted document IDs should be avoided if at all possible. A searchAfter that takes a position would either need to use some clever caching or perform the giant sorted collection when called. - Toke Eskildsen --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org