I think for testing the performance and scalability one can also use
synthetic data and it does not have to be real world data in the sense
of vectors generated from real world text.
But I think the more people revisit the testing of performance and
scalability the better and any help on this would be great!
Thanks
Michael W
Am 09.04.23 um 20:43 schrieb Dawid Weiss:
We do have a dataset built from Wikipedia in luceneutil. It comes in 100 and
300 dimensional varieties and can easily enough generate large numbers of
vector documents from the articles data. To go higher we could concatenate
vectors from that and I believe the performance numbers would be plausible.
Apologies - I wasn't clear - I thought of building the 1k or 2k
vectors that would be realistic. Perhaps using glove or perhaps using
some other software but something that would reflect a true 2k
dimensional space accurately with "real" data underneath. I am not
familiar enough with the field to tell whether a simple concatenation
is a good enough simulation - perhaps it is.
I would really prefer to focus on doing this kind of assessment of
feasibility/ limitations rather than arguing back and forth. I did my
experiment a while ago and I can't really tell whether there have been
improvements in the indexing/ merging part - your email contradicts my
experience Mike, so I'm a bit intrigued and would like to revisit it.
But it'd be ideal to work with real vectors rather than a simulation.
Dawid
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org