abiesps commented on issue #15197:
URL: https://github.com/apache/lucene/issues/15197#issuecomment-3339845848
Ok, I got a working solution. Code is really in very early stage (I need to
clean up a lot).
**Approach**
At a high level, this is what I am doing:
1. Start traversing the BKD tree as is,
2. If cell is inside the query
a) Call goes to visitDocIDs method variation where if isLeaf is true,
b) Do not read the leaf node as of yet. Instead get the leafOrdinal, if
this leaf ordinal is in continuation to last matching leaf ordinal (which I am
storing in visitor) i.e leafOrdinal == visitor.lastMatchingLeafOrdinal() + 1.
Do not call prefetch as this should be hopefully taken care by kernel
readaheads. Otherwise if its not a continuous match i.e leafOrdinal !=
visitor.lastMatchingLeafOrdinal() + 1, then call prefetch on this leaf node
file pointer's first page. Also store this leaf node fp in visitor for visting
matching doc IDs later. For early termination, also pass number of matching
points to visitor from "int count = isLastLeaf() ? lastLeafNodePointCount :
config.maxPointsInLeafNode();"
c) If its not a leafNode continue with recursion as is.
3. For remaining traversa,l code remains unchanged.
4. Once the traversal is complete, I am calling visitDocIDs on the leaf file
pointers I stored for visting later on.
**Benchmark Results**
I was able to do a cold index test, this is the test that I wrote
(https://github.com/abiesps/lucene-learnings/blob/main/src/main/java/com/sps/lucene/learnings/LuceneBKDTraversalPrefetchBenchmark.java),
and I am seeing following results
Iteration | p50 - with prefetching (nanos) | p50 - without prefetching
(nanos) | p90 - with prefetching (nanos) | p90 - without prefetching (nanos)
| p99 - with prefetching (nanos) | p99 - without prefetching (nanos)
-- | -- | -- | -- | -- | -- | --
1 | 1428747 | 1873421 | 3204614 | 4011134 | 6223980 | 7440390
2 | 1423662 | 1868136 | 3257577 | 3999692 | 6045371 | 7452711
3 | 1431170 | 1881138 | 3286315 | 4026663 | 6614156 | 7655345
4 | 1463966 | 1882202 | 3351880 | 4031743 | 6434328 | 7826272
5 | 1448586 | 1860455 | 3253163 | 3993159 | 5995588 | 7826855
6 | 1451429 | 1882752 | 3235150 | 4013533 | 5681142 | 7661777
7 | 1447876 | 1840063 | 3301089 | 4025598 | 6419166 | 7884995
8 | 1427931 | 1835098 | 3188111 | 4007786 | 6166556 | 7742863
9 | 1473577 | 1868620 | 3358062 | 4106067 | 7045409 | 8093894
10 | 1444055 | 1863376 | 3342137 | 3995365 | 6286142 | 7813705
Avg across all iteration | 1444099.9 | 1865526.1 | 3277809.8 | 4021074 |
6291183.8 | 7739880.7
I ran 1000 identical queries, for each iteration with explicitly clearing
the page cache between runs for with and without page cache.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]