I agree with Dawid,
I am +1 for those two options in combination:
* option 3 (make limit an HNSW specific thing). New formats may use
other limits (lower or higher).
* option 4 (make a system property with HNSW prefix). Adding the
system property must be done in same way like new properties for
MMAP directory (including access controller) so it can be denied by
system admin to be set in code (see
https://github.com/apache/lucene/blob/f53eb28af053d7612f7e4d1b2de05d33dc410645/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L327-L346
for example). Care has to be taken that the static initializers
won't fail is system properties cannot be read/set (system
adminitrator enforces default -> see mmap code). It also has to be
made sure that an index written with raised limit can still be read
without the limit, so the limit should not be glued into the file
format. Otherwise I disagree with option 4.
In short: I am fine with making it configurable only for HNSW if the
limit is not glued into index format. The default should only be there
to by default prevent people from doing wrong things, but changing
default should not break reading/modifiying those indexes.
Uwe
Am 16.05.2023 um 15:37 schrieb Dawid Weiss:
I'm for option 3 (limit at algorithm level), with the default there
tunable via property (option 4).
I understand Robert's concerns and I'd love to contribute a faster
implementation but the reality is - I can't do it at the moment. I
feel like experiments are good though and we shouldn't just ban people
from trying - if somebody changes the (sane) default and gets burned
by performance, perhaps it'll be an itch to work on speeding things up
(much like it's already happening with Jonathan's patch).
Dawid
On Tue, May 16, 2023 at 10:50 AM Alessandro Benedetti
<a.benede...@sease.io> wrote:
Hi all,
we have finalized all the options proposed by the community and we
are ready to vote for the preferred one and then proceed with the
implementation.
*Option 1*
Keep it as it is (dimension limit hardcoded to 1024)
*Motivation*:
We are close to improving on many fronts. Given the criticality of
Lucene in computing infrastructure and the concerns raised by one
of the most active stewards of the project, I think we should keep
working toward improving the feature as is and move to up the
limit after we can demonstrate improvement unambiguously.
*Option 2*
make the limit configurable, for example through a system property
*Motivation*:
The system administrator can enforce a limit its users need to
respect that it's in line with whatever the admin decided to be
acceptable for them.
The default can stay the current one.
This should open the doors for Apache Solr, Elasticsearch,
OpenSearch, and any sort of plugin development
*Option 3*
Move the max dimension limit lower level to a HNSW specific
implementation. Once there, this limit would not bind any other
potential vector engine alternative/evolution.*
*
*Motivation:*There seem to be contradictory performance
interpretations about the current HNSW implementation. Some
consider its performance ok, some not, and it depends on the
target data set and use case. Increasing the max dimension limit
where it is currently (in top level FloatVectorValues) would not
allow potential alternatives (e.g. for other use-cases) to be
based on a lower limit.
*Option 4*
Make it configurable and move it to an appropriate place.
In particular, a
simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024)
should be enough.
*Motivation*:
Both are good and not mutually exclusive and could happen in any
order.
Someone suggested to perfect what the _default_ limit should be,
but I've not seen an argument _against_ configurability.
Especially in this way -- a toggle that doesn't bind Lucene's APIs
in any way.
I'll keep this [VOTE] open for a week and then proceed to the
implementation.
--------------------------
*Alessandro Benedetti*
Director @ Sease Ltd.
/Apache Lucene/Solr Committer/
/Apache Solr PMC Member/
e-mail: a.benede...@sease.io/
/
*Sease* - Information Retrieval Applied
Consulting | Training | Open Source
Website: Sease.io <http://sease.io/>
LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
<https://twitter.com/seaseltd> | Youtube
<https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
Github <https://github.com/seaseltd>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de