krickert commented on PR #15676:
URL: https://github.com/apache/lucene/pull/15676#issuecomment-3937635485

   First, all great questions and you're getting to the heart of why I should 
increase the collection size to demo that this is going to help due to the 
increases I saw with my home-lab distribution setup.
   
   tl;dr - I suspect that pruning can only help if there is enough latency 
between shards and searches go over 10ms.  My home lab, due to using cheap 
machines, was a good candidate to show this and it did show significant 
improvement. 
   
   I can rerun the same tests I showed you in the slow environment because the 
testing harness that runs on localhost is a real streaming distributed search 
harness.  But I also need to show this on a localhost multi-shard setup, where 
latency is low but calculations are high.
   
   > I don't see how this will show anything different? What is the size of 
your data set now?
   
   Small.  Too small.  From the shard dirs:
   
   - **1 shard (full index, ~73K docs):** **~290 MB** on disk  
   - **16 shard dir (all 16 together):** **~289 MB** total  == **~18 MB per 
shard**
   
   So total index size is in the **hundreds of MB** (~0.3 GB), not 10M or 100M. 
That’s for 73K vectors × 1024 dims (float32) plus the HNSW graph.
   
   My machine is 128GB, and the drive operates at 20GB/s.. so the entire index 
is certainly in OS disk cache and the drive is faster than the raspberry pi 
memory.  That's why I need to add latency to the setup - both environments test 
two different extremes.  The larger machine does no sweating with these tests 
(the timing of the entire test is in low ms range).  We want to challenge the 
machine with at least 250ms queries like I did with the raspberry pi.
   
   We’re well below 1M per shard. The idea of going 10–20x larger isn't that a 
bigger index by itself proves anything; it’s that with more work per shard per 
query (more graph to traverse), there’s more for pruning to actually cut. 
   
   The problem is the entire corpus is in disk cache, the work we do is nearly 
instant.  That's why you're not seeing it kick in.  Collaboration doesn't seem 
to affect the overall search speed because there's a dedicated HTTP2 streaming 
connection that is always on during the search.  It's notification system is 
less than 1ms.  But with collaboration turned on, I can show you a 50% boost in 
speed and a 50% savings in CPU in some situations - which is why I tested this 
on a Raspberry Pi; it forced latency and "simulated" a larger corpus.
   
   But if I bring the index size up and run the searches concurrently, I saw it 
outperform the traditional search - because it waits for the entire search 
result set to yield when you shouldn't have to.  That's where it shines.
   
   I can measure the overhead to some degree.  I logged the events in the 
coordination layer and you can see the trimming live in the logs.  But don't 
believe me right now - if I just give it a large corpus, I'm convinced you'll 
see it for yourself.
   
   With 73K and a fast machine, each shard finishes so quickly that by the time 
a useful `min` is shared, others are often done - so we don’t see a difference. 
We’d expect **~1M per shard** to be enough to see whether sharing the 
`min_competitive` score helps under this setup; we simply haven't run at that 
scale yet.  I was thinking of using the court records as the data set - 11MM 
court documents that chunked will certainly be over 100MM chunks. 
   
   > How would [higher-latency setting] show any improvement? If sharing 
doesn’t help when latency is near zero, how would it improve when latency 
increases? That just means sharing is now more expensive.
   
   You’re right that **higher latency doesn’t make sharing better** - it makes 
it more expensive. The point wasn't to imply that adding latency to get 
improvement; it was that because we're running on localhost for all shards, 
it's an unrealistic test for a distributed search because it's low latency you 
wouldn't see in most setups.
   
   In my tested lab environment (2.5 Gbit, more latency), we did see larger 
savings (lookups_saved) AND a large reduction in latency - but just enough to 
make me realize I need more testing (much like how a 0 latency connection is 
unrealistic, so too is assuming the world will power lucene on all Raspberry Pi 
machines). 
   
   So I have two setups:
   
   * one setup where the benefit was visible but was too slow to be realistic 
but great to simulate a high latency environment
   * this local one where it wasn't (but showed no regression) due to being too 
fast of a machine 
    
   The interpretation I'm suggesting is that in that other setup there was 
**more work per query** (and/or slower shards), so pruning had something to 
cut; here, work per query is so small that pruning barely shows up. So 
"higher-latency setting" was shorthand for "the environment where we already 
saw the gain," not a claim that increasing latency causes the gain.
   
   So I'm trying to reproduce that kind of "more work per query" locally (e.g. 
with a much larger index) to see if the same benefit appears.
   
   
   **3. Naive sharding / orchestration**
   
   We’re testing **naive sharding** (no cluster-based routing; Lucene is the 
shard) on purpose: the question is whether **only** sharing the 
`min_competitive` score across shards helps in that setting. 
   
   We’re not claiming Lucene will do routing or orchestration - it never 
should.  But to have a collaborate search, exposing is needed.  And to test, a 
distributed search harness was necessary to create.
   
   We agree that more orchestration (e.g. routing clusters to shards, or more 
optimistic/segment-like search) would be more work and more communication than 
just sharing the min score - that is something I'll also measure.  That'll be 
up to the orchestration writers to do too.  gRPC works great for this though - 
I was able to code it in a few hours. But HTTP2/3 direct, a simple UDP packet, 
and more can easily be used too.  REST would be too much overhead - I tried it 
at first.
   
   The tests suggested that this overhead was minimal.  I also added a ticker 
before to only allow sharing from a shard within a threshold to prevent 
flooding, but that made the code ugly and was a premature optimization - so far 
I don't see the coordination being an issue even in the fast machine.  
   
   If you look at the code too, there's an initial wait before we use the 
shared value to terminate, but not before we share.  I detail it below but we 
do an initial wait of **100 visits** before we **use** the shared min to 
terminate. We do **not** wait before we **share** our min.
   
   **Using the shared min (early termination)**  
   In `CollaborativeKnnCollector.earlyTerminated()`:
   
   ```java
   if (visitedCount() < GLOBAL_BAR_MIN_VISITS) return false;
   ```
   
   `GLOBAL_BAR_MIN_VISITS` is `100`. So we do **not** allow early termination 
based on the global floor until this shard has done at least 100 node visits. 
Until then we keep searching regardless of what others have shared.
   
   **Sharing our min**  
   In `collect()`, we call `minScoreAcc.accumulate(...)` whenever we collect a 
new result and the local floor improves (above `lastSharedScore + 0.0001f`). 
There is no minimum visit count or delay before the first share; we share as 
soon as we have a floor to share.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to