Hi,
I don't understand the whole discussion here and I fully agree with
Robert. As of now it IS possible to change the maximum vector dimensions
by defining your own codec with a few lines of Java code. Solr is doing
that today. This approach is IMHO perfectly ok for backwards
compatibility, easy to do and allows people to kill their CPU and
hardware as they like:
You just need a wrapper for the vectors format and glue that into the codec:
*
https://github.com/apache/solr/blob/3aa6aa2085ac3ec5b90d181a7db7577c57318d4a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L149-L178
*
https://github.com/apache/solr/blob/3aa6aa2085ac3ec5b90d181a7db7577c57318d4a/solr/core/src/java/org/apache/solr/core/SchemaCodecFactory.java#L125-L139
This has several nice features:
* The default codec isn't changed, no backwards compatibilizty issue
* The user must make sure the codec is constructed correctly. In case
of Lucene codec updates they must apply the correct verison numbers.
This ensure, people know what they do!
I agree this is a bit of overhead fo the implementor, but still allows
advanced users to change it. Basically they define their own codec,
that's what Robert wants. As a simplification we may optionally add some
easier ways to define a codec with less boilerplate code, but that's
unrelated to the current dicussion. I'd like to see some builder pattern
to create a codec with a custom name.
Please stop arguing about all this limitis!
Uwe
Am 17.05.2023 um 04:58 schrieb Robert Muir:
My problem is that it impacts the default codec which is supported by
our backwards compatibility policy for many years. We can't just let
the user determine backwards compatibility with a sysprop. how will
checkindex work? We have to have bounds and also allow for more
performant implementations that might have different limitations. And
I'm pretty sure we want a faster implementation than what we have in
the future, and it will probably have different limits.
For other codecs, it is fine to have a different limit as I already
said, as it is implementation dependent. And honestly the stuff in
lucene/codecs can be more "Fast and loose" because it doesn't require
the extensive index back compat guarantee.
Again, penultimate concern is that index back compat guarantee. When
it comes to limits, the proper way is not to just keep bumping them
without technical reasons, instead the correct approach is to fix the
technical problems and make them irrelevant. Great example here
(merged this morning):
https://github.com/apache/lucene/commit/f53eb28af053d7612f7e4d1b2de05d33dc410645
On Tue, May 16, 2023 at 10:49 PM David Smiley <[email protected]> wrote:
Robert, I have not heard from you (or anyone) an argument against
System property based configurability (as I described in Option 4
via a System property). Uwe notes wisely some care must be taken
to ensure it actually works. Sure, of course. What concerns do
you have with this?
~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley
On Tue, May 16, 2023 at 9:50 PM Robert Muir <[email protected]> wrote:
by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED
to the hsnw-specific code.
This way, someone can write alternative codec with vectors
using some other completely different approach that
incorporates a different more appropriate limit (maybe lower,
maybe higher) depending upon their tradeoffs. We should
encourage this as I think it is the "only true fix" to the
scalability issues: use a scalable algorithm! Also,
alternative codecs don't force the project into many years of
index backwards compatibility, which is really my penultimate
concern. We can lock ourselves into a truly bad place and
become irrelevant (especially with scalar code implementing
all this vector stuff, it is really senseless). In the
meantime I suggest we try to reduce pain for the default codec
with the current implementation if possible. If it is not
possible, we need a new codec that performs.
On Tue, May 16, 2023 at 8:53 PM Robert Muir <[email protected]>
wrote:
Gus, I think i explained myself multiple times on issues
and in this thread. the performance is unacceptable,
everyone knows it, but nobody is talking about.
I don't need to explain myself time and time again here.
You don't seem to understand the technical issues (at
least you sure as fuck don't know how service loading
works or you wouldnt have opened
https://github.com/apache/lucene/issues/12300 😂)
I'm just the only one here completely unconstrained by any
of silicon valley's influences to speak my true mind,
without any repercussions, so I do it. Don't give any
fucks about ChatGPT.
I'm standing by my technical veto. If you bypass it, I'll
revert the offending commit.
As far as fixing the technical performance, I just opened
an issue with some ideas to at least improve cpu usage by
a factor of N. It does not help with the crazy heap memory
usage or other issues of KNN implementation causing shit
like OOM on merge. But it is one step:
https://github.com/apache/lucene/issues/12302
On Tue, May 16, 2023 at 7:45 AM Gus Heck
<[email protected]> wrote:
Robert,
Can you explain in clear technical terms the standard
that must be met for performance? A benchmark that
must run in X time on Y hardware for example (and why
that test is suitable)? Or some other reproducible
criteria? So far I've heard you give an *opinion* that
it's unusable, but that's not a technical criteria,
others may have a different concept of what is usable
to them.
Forgive me if I misunderstand, but the essence of your
argument has seemed to be
"Performance isn't good enough, therefore we should
force anyone who wants to experiment with something
bigger to fork the code base to do it"
Thus, it is necessary to have a clear
unambiguous standard that anyone can verify for "good
enough". A clear standard would also focus efforts at
improvement.
Where are the goal posts?
FWIW I'm +1 on any of 2-4 since I believe the
existence of a hard limit is fundamentally
counterproductive in an open source setting, as it
will lead to *fewer people* pushing the limits.
Extremely few people are going to get into the
nitty-gritty of optimizing things unless they are
staring at code that they can prove does something
interesting, but doesn't run fast enough for their
purposes. If people hit a hard limit, more of them
give up and never develop the code that will motivate
them to look for optimizations.
-Gus
On Tue, May 16, 2023 at 6:04 AM Robert Muir
<[email protected]> wrote:
i still feel -1 (veto) on increasing this limit.
sending more emails does not change the technical
facts or make the veto go away.
On Tue, May 16, 2023 at 4:50 AM Alessandro
Benedetti <[email protected]> wrote:
Hi all,
we have finalized all the options proposed by
the community and we are ready to vote for the
preferred one and then proceed with the
implementation.
*Option 1*
Keep it as it is (dimension limit hardcoded to
1024)
*Motivation*:
We are close to improving on many fronts.
Given the criticality of Lucene in computing
infrastructure and the concerns raised by one
of the most active stewards of the project, I
think we should keep working toward improving
the feature as is and move to up the limit
after we can demonstrate improvement
unambiguously.
*Option 2*
make the limit configurable, for example
through a system property
*Motivation*:
The system administrator can enforce a limit
its users need to respect that it's in line
with whatever the admin decided to be
acceptable for them.
The default can stay the current one.
This should open the doors for Apache Solr,
Elasticsearch, OpenSearch, and any sort of
plugin development
*Option 3*
Move the max dimension limit lower level to a
HNSW specific implementation. Once there, this
limit would not bind any other potential
vector engine alternative/evolution.*
*
*Motivation:*There seem to be contradictory
performance interpretations about the current
HNSW implementation. Some consider its
performance ok, some not, and it depends on
the target data set and use case. Increasing
the max dimension limit where it is currently
(in top level FloatVectorValues) would not
allow potential alternatives (e.g. for other
use-cases) to be based on a lower limit.
*Option 4*
Make it configurable and move it to an
appropriate place.
In particular, a
simple Integer.getInteger("lucene.hnsw.maxDimensions",
1024) should be enough.
*Motivation*:
Both are good and not mutually exclusive and
could happen in any order.
Someone suggested to perfect what the
_default_ limit should be, but I've not seen
an argument _against_ configurability.
Especially in this way -- a toggle that
doesn't bind Lucene's APIs in any way.
I'll keep this [VOTE] open for a week and then
proceed to the implementation.
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/
e-mail: [email protected]/
/
*Sease* - Information Retrieval Applied
Consulting | Training | Open Source
Website: Sease.io <http://sease.io/>
LinkedIn
<https://linkedin.com/company/sease-ltd> |
Twitter <https://twitter.com/seaseltd> |
Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>
--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:[email protected]