> We do have a dataset built from Wikipedia in luceneutil. It comes in 100 and > 300 dimensional varieties and can easily enough generate large numbers of > vector documents from the articles data. To go higher we could concatenate > vectors from that and I believe the performance numbers would be plausible.
Apologies - I wasn't clear - I thought of building the 1k or 2k vectors that would be realistic. Perhaps using glove or perhaps using some other software but something that would reflect a true 2k dimensional space accurately with "real" data underneath. I am not familiar enough with the field to tell whether a simple concatenation is a good enough simulation - perhaps it is. I would really prefer to focus on doing this kind of assessment of feasibility/ limitations rather than arguing back and forth. I did my experiment a while ago and I can't really tell whether there have been improvements in the indexing/ merging part - your email contradicts my experience Mike, so I'm a bit intrigued and would like to revisit it. But it'd be ideal to work with real vectors rather than a simulation. Dawid --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org