Hi Lucene Team, In general, I have advised very strongly against our team at MongoDB modifying the Lucene source, except in scenarios where we have strong needs for a particular customization. Ultimately, people can do what they would like to do.
That being said, we have a number of customers preparing to use Lucene for dense vector search. There are many language models that are optimized for > 1024 dimensions. I remember Michael Wechner's email <https://www.mail-archive.com/dev@lucene.apache.org/msg314281.html> about one instance with Open API. I just tried to test the OpenAI model > "text-similarity-davinci-001" with 12288 dimension It seems that customers who attempt to use these models should not be turned away. It could be sufficient to explain the issues. The only ones I have identified are two expected ones in very slow indexing throughput, high CPU usage, and a maybe less defined risk of more numerical errors. I opened an issue <https://github.com/apache/lucene/issues/1060> and PR <https://github.com/apache/lucene/pull/1061> for the discussion as well. I would appreciate guidance on where we think the warning should go. I feel like burying in a Javadoc is a less than ideal experience. It would be better to be a warning on startup. In the PR, I increased the max limit by a factor of twenty. We should let users use the system based on their needs even if it was designed or optimized for the models they bring because we need the feedback and the data from the world. Is there something I'm overlooking from a risk standpoint? Best, -- Marcus Eagan