[ https://issues.apache.org/jira/browse/LUCENE-10404 ]


    Michael Sokolov deleted comment on LUCENE-10404:
    ------------------------------------------

was (Author: sokolov):
I tried using IntIntHashMap (mapping to 1 for visited nodes) and indeed does 
seem to be a small speedup. I haven't had a chance to run luceneutil nor look 
at profiler output, but here are some numbers from KnnGraphTester for an 
internal dataset. The numbers can be a bit noisy, but are consistently better 
for the hash map version.
h3. IntIntHashMap

{{{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}}}
{{{{0.935    0.37   10000   0       16      32      100     1566}}}}
{{{{0.965    0.49   10000   50      16      32      150     0}}}}
{{{{0.962    0.41   10000   0       16      64      100     2655}}}}
{{{{0.982    0.57   10000   50      16      64      150     0}}}}
{{{{0.941    0.38   10000   0       32      32      100     1473}}}}
{{{{0.969    0.51   10000   50      32      32      150     0}}}}
{{{{0.966    0.45   10000   0       32      64      100     2611}}}}
{{{{0.985    0.59   10000   50      32      64      150     0}}}}
{{{{0.907    0.52   100000  0       16      32      100     19850}}}}
{{{{0.940    0.72   100000  50      16      32      150     0}}}}
{{{{0.941    0.60   100000  0       16      64      100     38614}}}}
{{{{0.966    0.84   100000  50      16      64      150     0}}}}
{{{{0.916    0.55   100000  0       32      32      100     19243}}}}
{{{{0.949    0.75   100000  50      32      32      150     0}}}}
{{{{0.952    0.66   100000  0       32      64      100     38205}}}}
{{{{0.973    0.93   100000  50      32      64      150     0}}}}
{{{{0.859    0.66   1000000 0       16      32      100     273112}}}}
{{{{0.897    0.92   1000000 50      16      32      150     0}}}}

{{0.917    0.85   1000000 0       16      64      100     523325}}
{{0.946    1.06   1000000 50      16      64      150     0}}



more to come – pushed ctrl-enter instead of enter ...
h3. baseline

{{recall  latency nDoc    fanout  maxConn beamWidth       visited index ms}}
{{0.935    0.38   10000   0       16      32      100     1614}}
{{0.965    0.50   10000   50      16      32      150     0}}
{{0.962    0.45   10000   0       16      64      100     2687}}
{{0.982    0.57   10000   50      16      64      150     0}}
{{0.941    0.40   10000   0       32      32      100     1504}}
{{0.969    0.51   10000   50      32      32      150     0}}
{{0.966    0.44   10000   0       32      64      100     2652}}
{{0.985    0.58   10000   50      32      64      150     0}}
{{0.907    0.54   100000  0       16      32      100     21449}}
{{0.940    0.74   100000  50      16      32      150     0}}
{{0.941    0.64   100000  0       16      64      100     39962}}
{{0.966    0.88   100000  50      16      64      150     0}}
{{0.916    0.59   100000  0       32      32      100     20554}}
{{0.949    0.80   100000  50      32      32      150     0}}
{{0.952    0.72   100000  0       32      64      100     40980}}
{{0.973    1.04   100000  50      32      64      150     0}}
{{0.859    0.75   1000000 0       16      32      100     300514}}
{{0.897    0.96   1000000 50      16      32      150     0}}
{{0.917    0.84   1000000 0       16      64      100     563259}}
{{0.946    1.12   1000000 50      16      64      150     0}}
{{0.874    0.86   1000000 0       32      32      100     303186}}
{{0.913    1.09   1000000 50      32      32      150     0}}
{{0.929    1.04   1000000 0       32      64      100     580725}}
{{0.958    1.38   1000000 50      32      64      150     0}}

> Use hash set for visited nodes in HNSW search?
> ----------------------------------------------
>
>                 Key: LUCENE-10404
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10404
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Julie Tibshirani
>            Priority: Minor
>
> While searching each layer, HNSW tracks the nodes it has already visited 
> using a BitSet. We could look into using something like IntHashSet instead. I 
> tried out the idea quickly by switching to IntIntHashMap (which has already 
> been copied from hppc) and saw an improvement in index performance. 
> *Baseline:* 760896 msec to write vectors
> *Using IntIntHashMap:* 733017 msec to write vectors
> I noticed search performance actually got a little bit worse with the change 
> -- that is something to look into.
> For background, it's good to be aware that HNSW can visit a lot of nodes. For 
> example, on the glove-100-angular dataset with ~1.2 million docs, HNSW search 
> visits ~1000 - 15,000 docs depending on the recall. This number can increase 
> when searching with deleted docs, especially if you hit a "pathological" case 
> where the deleted docs happen to be closest to the query vector.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to