[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566220#comment-17566220
 ] 

Robert Muir commented on LUCENE-10577:
--------------------------------------

{quote}
I tried looking at how DocValues are handling this since there is only one 
Codec and one DocValuesFormat, which to my mind means one codec, but it 
supports many different DocValues field types. I just don't understand what you 
mean by "scaling out horizontally with more codecs"? Is this about the actual 
file formats and not the java classes that represent them? I mean honestly if I 
look at Lucene90DocValuesConsumer it just exactly the sort of 
"wonder-do-it-all" thing  you are calling out. Do you think that should have 
been done differently too?
{quote}

What do you mean "many" different DocValues field types? There are five. 
Originally there were four, as it was the minimum number of types needed to 
implement FieldCache's functionality, SORTED_NUMERIC was added after-the-fact 
to provide a multi-valued numeric type. And yes, the number should be kept 
small for the same reasons.

While there is only currently "one" docvaluesformat, that's just looking at 
main branch and ignoring history and how we got there. dig a little deeper. Go 
back to 8.x codebase and you see 'DirectDocValuesFormat', go back to 7.x and 
you also see 'MemoryDocValueFormat'. Go back to 5.x and you also see 3 more 
spatial-related DV formats in the sandbox. 

Personally, I'm glad these trappy fieldcache-like formats that load stuff up on 
the heap are gone, but it took many major releases to evolve to that point. And 
at one time lucene sources (not tests) had 5 additional implementations, not 
counting simpletext.

So I think the docvalues case demonstrates is a reasonable evolution/maturity. 
Start out with FieldInfo etc stuff as simple as you can, since its *really* 
difficult to deal with back compat here, and implement experiments etc as 
alternative codecs and so on, so that different paths can be explored. Sure, 
maybe in lucene 14 the vectors situation will resemble the docvalues situation 
from a maturity perspective, but I don't think its anywhere close to that right 
now, so its a completely wrong comparison.

> Quantize vector values
> ----------------------
>
>                 Key: LUCENE-10577
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10577
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Michael Sokolov
>            Priority: Major
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to