I'm also in favor of raising this limit. We do see some
datasets with higher than 1024 dims. I also think we
need to keep a limit. For example we currently need to
keep all the vectors in RAM while indexing and we want
to be able to support reasonable numbers of vectors in
an index segment. Also we don't know what innovations
might come down the road. Maybe someday we want to do
product quantization and enforce that (k, m) both fit in
a byte -- we wouldn't be able to do that if a vector's
dimension were to exceed 32K.
On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti
<a.benede...@sease.io> wrote:
I am also curious what would be the worst-case
scenario if we remove the constant at all (so
automatically the limit becomes the Java
Integer.MAX_VALUE).
i.e.
right now if you exceed the limit you get:
if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
throw new IllegalArgumentException(
"cannot index vectors with dimension greater
than " + ByteVectorValues.MAX_DIMENSIONS);
}
in relation to:
These limits allow us to
better tune our data structures, prevent
overflows, help ensure we
have good test coverage, etc.
I agree 100% especially for typing stuff properly
and avoiding resource waste here and there, but I am
not entirely sure this is the case for the current
implementation i.e. do we have optimizations in
place that assume the max dimension to be 1024?
If I missed that (and I likely have), I of course
suggest the contribution should not just blindly
remove the limit, but do it appropriately.
I am not in favor of just doubling it as suggested
by some people, I would ideally prefer a solution
that remains there to a decent extent, rather than
having to modifying it anytime someone requires a
higher limit.
Cheers
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/
e-mail: a.benede...@sease.io/
/
*Sease* - Information Retrieval Applied
Consulting | Training | Open Source
Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> |
Twitter <https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>
On Fri, 31 Mar 2023 at 16:12, Michael Wechner
<michael.wech...@wyona.com> wrote:
OpenAI reduced their size to 1536 dimensions
https://openai.com/blog/new-and-improved-embedding-model
so 2048 would work :-)
but other services do provide also higher
dimensions with sometimes
slightly better accuracy
Thanks
Michael
Am 31.03.23 um 14:45 schrieb Adrien Grand:
> I'm supportive of bumping the limit on the
maximum dimension for
> vectors to something that is above what the
majority of users need,
> but I'd like to keep a limit. We have limits
for other things like the
> max number of docs per index, the max term
length, the max number of
> dimensions of points, etc. and there are a few
things that we don't
> have limits on that I wish we had limits on.
These limits allow us to
> better tune our data structures, prevent
overflows, help ensure we
> have good test coverage, etc.
>
> That said, these other limits we have in place
are quite high. E.g.
> the 32kB term limit, nobody would ever type a
32kB term in a text box.
> Likewise for the max of 8 dimensions for
points: a segment cannot
> possibly have 2 splits per dimension on
average if it doesn't have
> 512*2^(8*2)=34M docs, a sizable dataset
already, so more dimensions
> than 8 would likely defeat the point of
indexing. In contrast, our
> limit on the number of dimensions of vectors
seems to be under what
> some users would like, and while I understand
the performance argument
> against bumping the limit, it doesn't feel to
me like something that
> would be so bad that we need to prevent users
from using numbers of
> dimensions in the low thousands, e.g. top-k
KNN searches would still
> look at a very small subset of the full dataset.
>
> So overall, my vote would be to bump the limit
to 2048 as suggested by
> Mayya on the issue that you linked.
>
> On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner
> <michael.wech...@wyona.com> wrote:
>> Thanks Alessandro for summarizing the
discussion below!
>>
>> I understand that there is no clear reasoning
re what is the best embedding size, whereas I
think heuristic approaches like described by the
following link can be helpful
>>
>>
https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
>>
>> Having said this, we see various embedding
services providing higher dimensions than 1024,
like for example OpenAI, Cohere and Aleph Alpha.
>>
>> And it would be great if we could run
benchmarks without having to recompile Lucene
ourselves.
>>
>> Therefore I would to suggest to either
increase the limit or even better to remove the
limit and add a disclaimer, that people should
be aware of possible crashes etc.
>>
>> Thanks
>>
>> Michael
>>
>>
>>
>>
>> Am 31.03.23 um 11:43 schrieb Alessandro
Benedetti:
>>
>>
>> I've been monitoring various discussions on
Pull Requests about changing the max number of
dimensions allowed for Lucene HNSW vectors:
>>
>> https://github.com/apache/lucene/pull/12191
>>
>> https://github.com/apache/lucene/issues/11507
>>
>>
>> I would like to set up a discussion and
potentially a vote about this.
>>
>> I have seen some strong opposition from a few
people but a majority of favor in this direction.
>>
>>
>> Motivation
>>
>> We were discussing in the Solr slack channel
with Ishan Chattopadhyaya, Marcus Eagan, and
David Smiley about some neural search
integrations in Solr:
https://github.com/openai/chatgpt-retrieval-plugin
>>
>>
>> Proposal
>>
>> No hard limit at all.
>>
>> As for many other Lucene areas, users will be
allowed to push the system to the limit of their
resources and get terrible performances or
crashes if they want.
>>
>>
>> What we are NOT discussing
>>
>> - Quality and scalability of the HNSW algorithm
>>
>> - dimensionality reduction
>>
>> - strategies to fit in an arbitrary
self-imposed limit
>>
>>
>> Benefits
>>
>> - users can use the models they want to
generate vectors
>>
>> - removal of an arbitrary limit that blocks
some integrations
>>
>>
>> Cons
>>
>> - if you go for vectors with high
dimensions, there's no guarantee you get
acceptable performance for your use case
>>
>>
>>
>> I want to keep it simple, right now in many
Lucene areas, you can push the system to not
acceptable performance/ crashes.
>>
>> For example, we don't limit the number of
docs per index to an arbitrary maximum of N, you
push how many docs you like and if they are too
much for your system, you get terrible
performance/crashes/whatever.
>>
>>
>> Limits caused by primitive java types will
stay there behind the scene, and that's
acceptable, but I would prefer to not have
arbitrary hard-coded ones that may limit the
software usability and integration which is
extremely important for a library.
>>
>>
>> I strongly encourage people to add benefits
and cons, that I missed (I am sure I missed some
of them, but wanted to keep it simple)
>>
>>
>> Cheers
>>
>> --------------------------
>> Alessandro Benedetti
>> Director @ Sease Ltd.
>> Apache Lucene/Solr Committer
>> Apache Solr PMC Member
>>
>> e-mail: a.benede...@sease.io
>>
>>
>> Sease - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io
>> LinkedIn | Twitter | Youtube | Github
>>
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:
dev-h...@lucene.apache.org