Important data point and it doesn't seem too bad or good. What is
acceptable performance should be decided by the user? What do you all think?

On Fri, Apr 7, 2023 at 8:20 AM Michael Sokolov <msoko...@gmail.com> wrote:

> one more data point:
>
> 32M 100dim (fp32) vectors indexed in 1h20m (M=16, IW cache=1994, heap=4GB)
>
> On Fri, Apr 7, 2023 at 8:52 AM Michael Sokolov <msoko...@gmail.com> wrote:
> >
> > I also want to add that we do impose some other limits on graph
> > construction to help ensure that HNSW-based vector fields remain
> > manageable; M is limited to <= 512, and maximum segment size also
> > helps limit merge costs
> >
> > On Fri, Apr 7, 2023 at 7:45 AM Michael Sokolov <msoko...@gmail.com>
> wrote:
> > >
> > > Thanks Kent - I tried something similar to what you did I think. Took
> > > a set of 256d vectors I had and concatenated them to make bigger ones,
> > > then shifted the dimensions to make more of them. Here are a few
> > > single-threaded indexing test runs. I ran all tests with M=16.
> > >
> > >
> > > 8M 100d float vectors indexed in 20 minutes (16G heap, IndexWriter
> > > buffer size=1994)
> > > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
> > > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer
> size=1994)
> > >
> > > increasing the vector dimension makes things take longer (scaling
> > > *linearly*) but doesn't lead to RAM issues. I think we could get to
> > > OOM while merging with a small heap and a large number of vectors, or
> > > by increasing M, but none of this has anything to do with vector
> > > dimensions. Also, if merge RAM usage is a problem I think we could
> > > address it by adding accounting to the merge process and simply not
> > > merging graphs when they exceed the buffer size (as we do with
> > > flushing).
> > >
> > > Robert, since you're the only on-the-record veto here, does this
> > > change your thinking at all, or if not could you share some test
> > > results that didn't go the way you expected? Maybe we can find some
> > > mitigation if we focus on a specific issue.
> > >
> > > On Fri, Apr 7, 2023 at 5:18 AM Kent Fitch <kent.fi...@gmail.com>
> wrote:
> > > >
> > > > Hi,
> > > > I have been testing Lucene with a custom vector similarity and
> loaded 192m vectors of dim 512 bytes. (Yes, segment merges use a lot of
> java memory..).
> > > >
> > > > As this was a performance test, the 192m vectors were derived by
> dithering 47k original vectors in such a way to allow realistic ANN
> evaluation of HNSW.  The original 47k vectors were generated by ada-002 on
> source newspaper article text.  After dithering, I used PQ to reduce their
> dimensionality from 1536 floats to 512 bytes - 3 source dimensions to a
> 1byte code, 512 code tables, each learnt to reduce total encoding error
> using Lloyds algorithm (hence the need for the custom similarity). BTW,
> HNSW retrieval was accurate and fast enough for the use case I was
> investigating as long as a machine with 128gb memory was available as the
> graph needs to be cached in memory for reasonable query rates.
> > > >
> > > > Anyway, if you want them, you are welcome to those 47k vectors of
> 1532 floats which can be readily dithered to generate very large and
> realistic test vector sets.
> > > >
> > > > Best regards,
> > > >
> > > > Kent Fitch
> > > >
> > > >
> > > > On Fri, 7 Apr 2023, 6:53 pm Michael Wechner, <
> michael.wech...@wyona.com> wrote:
> > > >>
> > > >> you might want to use SentenceBERT to generate vectors
> > > >>
> > > >> https://sbert.net
> > > >>
> > > >> whereas for example the model "all-mpnet-base-v2" generates vectors
> with dimension 768
> > > >>
> > > >> We have SentenceBERT running as a web service, which we could open
> for these tests, but because of network latency it should be faster running
> locally.
> > > >>
> > > >> HTH
> > > >>
> > > >> Michael
> > > >>
> > > >>
> > > >> Am 07.04.23 um 10:11 schrieb Marcus Eagan:
> > > >>
> > > >> I've started to look on the internet, and surely someone will come,
> but the challenge I suspect is that these vectors are expensive to generate
> so people have not gone all in on generating such large vectors for large
> datasets. They certainly have not made them easy to find. Here is the most
> promising but it is too small, probably:
> https://www.kaggle.com/datasets/stephanst/wikipedia-simple-openai-embeddings?resource=download
> > > >>
> > > >>  I'm still in and out of the office at the moment, but when I
> return, I can ask my employer if they will sponsor a 10 million document
> collection so that you can test with that. Or, maybe someone from work will
> see and ask them on my behalf.
> > > >>
> > > >> Alternatively, next week, I may get some time to set up a server
> with an open source LLM to generate the vectors. It still won't be free,
> but it would be 99% cheaper than paying the LLM companies if we can be slow.
> > > >>
> > > >>
> > > >>
> > > >> On Thu, Apr 6, 2023 at 9:42 PM Michael Wechner <
> michael.wech...@wyona.com> wrote:
> > > >>>
> > > >>> Great, thank you!
> > > >>>
> > > >>> How much RAM; etc. did you run this test on?
> > > >>>
> > > >>> Do the vectors really have to be based on real data for testing the
> > > >>> indexing?
> > > >>> I understand, if you want to test the quality of the search
> results it
> > > >>> does matter, but for testing the scalability itself it should not
> matter
> > > >>> actually, right?
> > > >>>
> > > >>> Thanks
> > > >>>
> > > >>> Michael
> > > >>>
> > > >>> Am 07.04.23 um 01:19 schrieb Michael Sokolov:
> > > >>> > I'm trying to run a test. I indexed 8M 100d float32 vectors in
> ~20
> > > >>> > minutes with a single thread. I have some 256K vectors, but only
> about
> > > >>> > 2M of them. Can anybody point me to a large set (say 8M+) of
> 1024+ dim
> > > >>> > vectors I can use for testing? If all else fails I can test with
> > > >>> > noise, but that tends to lead to meaningless results
> > > >>> >
> > > >>> > On Thu, Apr 6, 2023 at 3:52 PM Michael Wechner
> > > >>> > <michael.wech...@wyona.com> wrote:
> > > >>> >>
> > > >>> >>
> > > >>> >> Am 06.04.23 um 17:47 schrieb Robert Muir:
> > > >>> >>> Well, I'm asking ppl actually try to test using such high
> dimensions.
> > > >>> >>> Based on my own experience, I consider it unusable. It seems
> other
> > > >>> >>> folks may have run into trouble too. If the project committers
> can't
> > > >>> >>> even really use vectors with such high dimension counts, then
> its not
> > > >>> >>> in an OK state for users, and we shouldn't bump the limit.
> > > >>> >>>
> > > >>> >>> I'm happy to discuss/compromise etc, but simply bumping the
> limit
> > > >>> >>> without addressing the underlying usability/scalability is a
> real
> > > >>> >>> no-go,
> > > >>> >> I agree that this needs to be adressed
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >>>    it is not really solving anything, nor is it giving users
> any
> > > >>> >>> freedom or allowing them to do something they couldnt do
> before.
> > > >>> >>> Because if it still doesnt work it still doesnt work.
> > > >>> >> I disagree, because it *does work* with "smaller" document sets.
> > > >>> >>
> > > >>> >> Currently we have to compile Lucene ourselves to not get the
> exception
> > > >>> >> when using a model with vector dimension greater than 1024,
> > > >>> >> which is of course possible, but not really convenient.
> > > >>> >>
> > > >>> >> As I wrote before, to resolve this discussion, I think we
> should test
> > > >>> >> and address possible issues.
> > > >>> >>
> > > >>> >> I will try to stop discussing now :-) and instead try to
> understand
> > > >>> >> better the actual issues. Would be great if others could join
> on this!
> > > >>> >>
> > > >>> >> Thanks
> > > >>> >>
> > > >>> >> Michael
> > > >>> >>
> > > >>> >>
> > > >>> >>
> > > >>> >>> We all need to be on the same page, grounded in reality, not
> fantasy,
> > > >>> >>> where if we set a limit of 1024 or 2048, that you can actually
> index
> > > >>> >>> vectors with that many dimensions and it actually works and
> scales.
> > > >>> >>>
> > > >>> >>> On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
> > > >>> >>> <a.benede...@sease.io> wrote:
> > > >>> >>>> As I said earlier, a max limit limits usability.
> > > >>> >>>> It's not forcing users with small vectors to pay the
> performance penalty of big vectors, it's literally preventing some users to
> use Lucene/Solr/Elasticsearch at all.
> > > >>> >>>> As far as I know, the max limit is used to raise an
> exception, it's not used to initialise or optimise data structures (please
> correct me if I'm wrong).
> > > >>> >>>>
> > > >>> >>>> Improving the algorithm performance is a separate discussion.
> > > >>> >>>> I don't see a correlation with the fact that indexing
> billions of whatever dimensioned vector is slow with a usability parameter.
> > > >>> >>>>
> > > >>> >>>> What about potential users that need few high dimensional
> vectors?
> > > >>> >>>>
> > > >>> >>>> As I said before, I am a big +1 for NOT just raise it
> blindly, but I believe we need to remove the limit or size it in a way it's
> not a problem for both users and internal data structure optimizations, if
> any.
> > > >>> >>>>
> > > >>> >>>>
> > > >>> >>>> On Wed, 5 Apr 2023, 18:54 Robert Muir, <rcm...@gmail.com>
> wrote:
> > > >>> >>>>> I'd ask anyone voting +1 to raise this limit to at least try
> to index
> > > >>> >>>>> a few million vectors with 756 or 1024, which is allowed
> today.
> > > >>> >>>>>
> > > >>> >>>>> IMO based on how painful it is, it seems the limit is
> already too
> > > >>> >>>>> high, I realize that will sound controversial but please at
> least try
> > > >>> >>>>> it out!
> > > >>> >>>>>
> > > >>> >>>>> voting +1 without at least doing this is really the
> > > >>> >>>>> "weak/unscientifically minded" approach.
> > > >>> >>>>>
> > > >>> >>>>> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
> > > >>> >>>>> <michael.wech...@wyona.com> wrote:
> > > >>> >>>>>> Thanks for your feedback!
> > > >>> >>>>>>
> > > >>> >>>>>> I agree, that it should not crash.
> > > >>> >>>>>>
> > > >>> >>>>>> So far we did not experience crashes ourselves, but we did
> not index
> > > >>> >>>>>> millions of vectors.
> > > >>> >>>>>>
> > > >>> >>>>>> I will try to reproduce the crash, maybe this will help us
> to move forward.
> > > >>> >>>>>>
> > > >>> >>>>>> Thanks
> > > >>> >>>>>>
> > > >>> >>>>>> Michael
> > > >>> >>>>>>
> > > >>> >>>>>> Am 05.04.23 um 18:30 schrieb Dawid Weiss:
> > > >>> >>>>>>>> Can you describe your crash in more detail?
> > > >>> >>>>>>> I can't. That experiment was a while ago and a quick test
> to see if I
> > > >>> >>>>>>> could index rather large-ish USPTO (patent office) data as
> vectors.
> > > >>> >>>>>>> Couldn't do it then.
> > > >>> >>>>>>>
> > > >>> >>>>>>>> How much RAM?
> > > >>> >>>>>>> My indexing jobs run with rather smallish heaps to give
> space for I/O
> > > >>> >>>>>>> buffers. Think 4-8GB at most. So yes, it could have been
> the problem.
> > > >>> >>>>>>> I recall segment merging grew slower and slower and then
> simply
> > > >>> >>>>>>> crashed. Lucene should work with low heap requirements,
> even if it
> > > >>> >>>>>>> slows down. Throwing ram at the indexing/ segment merging
> problem
> > > >>> >>>>>>> is... I don't know - not elegant?
> > > >>> >>>>>>>
> > > >>> >>>>>>> Anyway. My main point was to remind folks about how Apache
> works -
> > > >>> >>>>>>> code is merged in when there are no vetoes. If Rob (or
> anybody else)
> > > >>> >>>>>>> remains unconvinced, he or she can block the change. (I
> didn't invent
> > > >>> >>>>>>> those rules).
> > > >>> >>>>>>>
> > > >>> >>>>>>> D.
> > > >>> >>>>>>>
> > > >>> >>>>>>>
> ---------------------------------------------------------------------
> > > >>> >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > >>> >>>>>>> For additional commands, e-mail:
> dev-h...@lucene.apache.org
> > > >>> >>>>>>>
> > > >>> >>>>>>
> ---------------------------------------------------------------------
> > > >>> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > >>> >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
> > > >>> >>>>>>
> > > >>> >>>>>
> ---------------------------------------------------------------------
> > > >>> >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > >>> >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
> > > >>> >>>>>
> > > >>> >>>
> ---------------------------------------------------------------------
> > > >>> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > >>> >>> For additional commands, e-mail: dev-h...@lucene.apache.org
> > > >>> >>>
> > > >>> >>
> > > >>> >>
> ---------------------------------------------------------------------
> > > >>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > >>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> > > >>> >>
> > > >>> >
> ---------------------------------------------------------------------
> > > >>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > >>> > For additional commands, e-mail: dev-h...@lucene.apache.org
> > > >>> >
> > > >>>
> > > >>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > >>> For additional commands, e-mail: dev-h...@lucene.apache.org
> > > >>>
> > > >>
> > > >>
> > > >> --
> > > >> Marcus Eagan
> > > >>
> > > >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
Marcus Eagan

Reply via email to