tyronecai commented on PR #15779:
URL: https://github.com/apache/lucene/pull/15779#issuecomment-3997836126
> Sorry for all the ideas -- we can also pinch & ship what you already
created -- it looks like an amazing win as is -- PNP! ("progress not
perfection").
>
> I'm curious if this is needle moving for indexing overall? We can wait and
see what nightly benchy thinks, after we merge this.
>
> I had another realization: we say the `ids` array stores the fingerprint
(high bits) of the key's hash code. Since these ids are actually ordinals
(assigned as 0, 1, 2, 3, ... as we see each unique `BytesRef` being added), we
have many free 0 high bits to use for fingerprint.
>
> But I think it's not actually a fingerprint? I think it is the entire hash
code (when added to position in `ids` array, the lower hash bits)? We shouldn't
ever need to hash() on something already in the table (just the incoming key
being added/get'd)? There must always be enough free 0 high bits in the ids,
since they are compact ordinals, and we grow the hash table (which stores lower
k bits of hash, then k+1 its on rehash, ...) well before id value gets to hash
table size?
I encountered several issues while trying to write the code:
1. BytesRefBlockPool lacks an API for iteration.
2. BytesRefBlockPool doesn't provide enough information to tell you how many
terms are in each byte[] block. It can only calculate using the `vint`'s
length, but this can be inaccurate. For example, if there's unused space at the
end of the byte[] block, the data will be all 0s, resulting in multiple terms
of length 0 which may not actually exist.
3. As I mentioned before, we don't actually have the low-order bits of the
hashcode: The old slot `i` is often the position after the probe, not the
original hash & `oldMask`, losing the "probe displacement" information, making
it impossible to reliably deduce that 1 bit (or even more low-order bits).
I also tried creating a `hashcodes[]` member variable similar to
`bytesStart[]` in the `BytesRefHash` class to store the hashcodes of all added
terms. This avoids recalculating hash values during rehashing and can be used
for comparison during `findHash`. However, benchmarking revealed that the
benefits were not significant compared to the current modifications, and it
also incurred additional memory overhead.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]