venkateshwaracholan commented on issue #16251:
URL: https://github.com/apache/lucene/issues/16251#issuecomment-4697679092

   @vsop-479  dug into this a bit further and was able to reproduce the 
behavior outside of the failing HNSW seed.
   
   Using the reported seed (`8B08865AA8BAB759:E27CE1241E7C9A82`), the generated 
graph has 111 nodes, with most nodes having degree 1 and two high-degree hubs 
(58 and 53). `computeJoinSet()` returns all 111 nodes and stops because `gTot 
>= gExit`.
   
   To understand whether this was specific to the seed or HNSW construction, I 
built a few graphs manually with `OnHeapHnswGraph` and traced 
`computeJoinSet()`. In particular, simple hub-and-spoke graphs (one hub with 
many degree-1 leaves) also produce `joinSet == graph`, even without random 
vectors or HNSW graph construction.
   
   Smallest handcrafted case I found is a 3-node hub-and-spoke graph (hub + 2 
leaves). The trace is attached, but the short version is:
   
   * `gExit = 6`
   * hub contributes `gain = 4`
   * both leaves become stale and are re-evaluated with `gain = 1`
   * the algorithm reaches `gTot >= gExit` only after selecting all three nodes
   
   Result: `joinSet = 3/3`.
   
   By contrast, graphs with more overlap (e.g. two hubs with shared leaves) 
produce much smaller join sets.
   
   From the traces, it looks like `computeJoinSet()` is optimizing for reaching 
the coverage budget (`gExit`) rather than minimizing join-set size. On 
hub-and-spoke topologies, the hubs contribute a large initial gain, but after 
stale recalculation the leaves contribute only 1 each, so the algorithm ends up 
selecting nearly every node in order to reach `gExit`.
   
   This makes me wonder whether the failing seed is exposing a graph topology 
where `joinSet == graph` is expected behavior of the current heuristic, rather 
than a merge correctness issue. The test currently asserts `j.size() < 
graph.size()`, but I couldn't find a corresponding guarantee in 
`computeJoinSet()` itself.
   
   Curious what the intended behavior is here. Should the test be relaxed, or 
is a full join set considered a bug in the heuristic?
   
   
[joinset-handcrafted-3node.log](https://github.com/user-attachments/files/28907986/joinset-handcrafted-3node.log)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to