I think concatenating word-embedding vectors is a reasonable thing to
do. It captures information about the sequence of tokens which is
being lost by the current approach (summing them). Random article I
found in a search
https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca
shows higher performance with a concatenative approach. So it seems to
me we could take the 300-dim Glove vectors and produce somewhat
meaningful (say) 1200- or 1500-dim vectors by running a sliding window
over the tokens in a document and concatenating the token-vectors

On Sun, Apr 9, 2023 at 2:44 PM Dawid Weiss <dawid.we...@gmail.com> wrote:
>
> > We do have a dataset built from Wikipedia in luceneutil. It comes in 100 
> > and 300 dimensional varieties and can easily enough generate large numbers 
> > of vector documents from the articles data. To go higher we could 
> > concatenate vectors from that and I believe the performance numbers would 
> > be plausible.
>
> Apologies - I wasn't clear - I thought of building the 1k or 2k
> vectors that would be realistic. Perhaps using glove or perhaps using
> some other software but something that would reflect a true 2k
> dimensional space accurately with "real" data underneath. I am not
> familiar enough with the field to tell whether a simple concatenation
> is a good enough simulation - perhaps it is.
>
> I would really prefer to focus on doing this kind of assessment of
> feasibility/ limitations rather than arguing back and forth. I did my
> experiment a while ago and I can't really tell whether there have been
> improvements in the indexing/ merging part - your email contradicts my
> experience Mike, so I'm a bit intrigued and would like to revisit it.
> But it'd be ideal to work with real vectors rather than a simulation.
>
> Dawid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to