I think concatenating word-embedding vectors is a reasonable thing to do. It captures information about the sequence of tokens which is being lost by the current approach (summing them). Random article I found in a search https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca shows higher performance with a concatenative approach. So it seems to me we could take the 300-dim Glove vectors and produce somewhat meaningful (say) 1200- or 1500-dim vectors by running a sliding window over the tokens in a document and concatenating the token-vectors
On Sun, Apr 9, 2023 at 2:44 PM Dawid Weiss <dawid.we...@gmail.com> wrote: > > > We do have a dataset built from Wikipedia in luceneutil. It comes in 100 > > and 300 dimensional varieties and can easily enough generate large numbers > > of vector documents from the articles data. To go higher we could > > concatenate vectors from that and I believe the performance numbers would > > be plausible. > > Apologies - I wasn't clear - I thought of building the 1k or 2k > vectors that would be realistic. Perhaps using glove or perhaps using > some other software but something that would reflect a true 2k > dimensional space accurately with "real" data underneath. I am not > familiar enough with the field to tell whether a simple concatenation > is a good enough simulation - perhaps it is. > > I would really prefer to focus on doing this kind of assessment of > feasibility/ limitations rather than arguing back and forth. I did my > experiment a while ago and I can't really tell whether there have been > improvements in the indexing/ merging part - your email contradicts my > experience Mike, so I'm a bit intrigued and would like to revisit it. > But it'd be ideal to work with real vectors rather than a simulation. > > Dawid > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org