Re: [PR] Improve BytesRefHash.add performance by optimize rehash operation [lucene]

via GitHub Wed, 04 Mar 2026 06:18:10 -0800


tyronecai commented on PR #15779:
URL: https://github.com/apache/lucene/pull/15779#issuecomment-3997836126


   > Sorry for all the ideas -- we can also pinch & ship what you already 
created -- it looks like an amazing win as is -- PNP! ("progress not 
perfection").
   > 
   > I'm curious if this is needle moving for indexing overall? We can wait and 
see what nightly benchy thinks, after we merge this.
   > 
   > I had another realization: we say the `ids` array stores the fingerprint 
(high bits) of the key's hash code. Since these ids are actually ordinals 
(assigned as 0, 1, 2, 3, ... as we see each unique `BytesRef` being added), we 
have many free 0 high bits to use for fingerprint.
   > 
   > But I think it's not actually a fingerprint? I think it is the entire hash 
code (when added to position in `ids` array, the lower hash bits)? We shouldn't 
ever need to hash() on something already in the table (just the incoming key 
being added/get'd)? There must always be enough free 0 high bits in the ids, 
since they are compact ordinals, and we grow the hash table (which stores lower 
k bits of hash, then k+1 its on rehash, ...) well before id value gets to hash 
table size?
   
   I encountered several issues while trying to write the code:
   
   1. BytesRefBlockPool lacks an API for iteration.
   
   2. BytesRefBlockPool doesn't provide enough information to tell you how many 
terms are in each byte[] block. It can only calculate using the `vint`'s 
length, but this can be inaccurate. For example, if there's unused space at the 
end of the byte[] block, the data will be all 0s, resulting in multiple terms 
of length 0 which may not actually exist.
   
   3. As I mentioned before, we don't actually have the low-order bits of the 
hashcode: The old slot `i` is often the position after the probe, not the 
original hash & `oldMask`, losing the "probe displacement" information, making 
it impossible to reliably deduce that 1 bit (or even more low-order bits).
   
   I also tried creating a `hashcodes[]` member variable similar to 
`bytesStart[]` in the `BytesRefHash` class to store the hashcodes of all added 
terms. This avoids recalculating hash values during rehashing and can be used 
for comparison during `findHash`. However, benchmarking revealed that the 
benefits were not significant compared to the current modifications, and it 
also incurred additional memory overhead.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Improve BytesRefHash.add performance by optimize rehash operation [lucene]

Reply via email to