As it happens, there is in fact a formal analytical technique that yields
appropriate numbers of "well normalized" dimensions for word embeddings:

https://github.com/ziyin-dl/word-embedding-dimensionality-selection
<https://github.com/ziyin-dl/word-embedding-dimensionality-selection?fbclid=IwAR2vMeDqMhUqZlQvLj2tpdMCmR09UUpXSnryS73NEG4R37LLWu0fXUoreFc>

That's not to say that Machine Learning requirements should drive SQLite,
but the ask is merely to expose a knob with an already defined higher
capacity.

Also, the correct link for [3] above is here:

https://github.com/facebookresearch/InferSent




On Sat, Jun 15, 2019 at 6:42 AM Dan Kaminsky <dan.kamin...@medal.com> wrote:

> Sqlite3 has something of a normative declaration in its source code:
>
> *
> ** This is the maximum number of
> **
> **    * Columns in a table
> **    * Columns in an index
> **    * Columns in a view
> **    * Terms in the SET clause of an UPDATE statement
> **    * Terms in the result set of a SELECT statement
> **    * Terms in the GROUP BY or ORDER BY clauses of a SELECT statement.
> **    * Terms in the VALUES clause of an INSERT statement
> **
> ** The hard upper limit here is 32676.  Most database people will
> ** tell you that in a well-normalized database, you usually should
> ** not have more than a dozen or so columns in any table.  And if
> ** that is the case, there is no point in having more than a few
> ** dozen values in any of the other situations described above.
> */
> #ifndef SQLITE_MAX_COLUMN
> # define SQLITE_MAX_COLUMN 2000
> #endif
>
> All software has constraints fundamental to its problem set and particular
> implementation, and I would not mail you simply for deciding an
> architecture on behalf of your users.  You do that every day.
>
> However, conditions have changed since (I expect) this design was
> specified.  One of the more useful and usable packages for Natural Language
> Processing, Magnitude[1], leverages SQLite to efficiently handle the real
> valued but entirely abstract collections of numbers -- vector spaces --
> that modern machine learning depends on.
>
> Traditionally, word2vec and other language vector approaches have fit
> their models into tens or hundreds of dimensions.  New work is acquiring
> the context of each word -- "Huntington", preceeded by "Mr.", vs.
> "Huntington" preceeded by "Street" or followed by Disease.  These phrase or
> sentence embeddings (as they're called) can take quite a bit more space to
> represent necessary context.  They may take 3072 columns (in the case of
> Magnitude's ELMo records[2]) or even 4096 columns (in the case of
> InferSent's sentence embeddings[3]).
>
> That is exceeding SQLite's limits.
>
> These spaces, though abstract, have become the most powerful way we know
> to not merely represent, but actually discover relationships.  This is
> simply a new data domain that is the *input* to what eventually becomes the
> familiarly normalizable relational domain.  It's different, but it's not
> "wrong".  If it's behavior you can support -- as your 32K hard limit
> implies -- it would certainly be helpful for these user scenarios.
>
> I spent quite a bit of time hacking large column support into a working
> Python pipeline, and I'd prefer never to run that in production.
> Converting this compile time variable into a runtime knob would be
> appreciated.
>
> --Dan
>
> [1] https://github.com/plasticityai/magnitude
> [2] https://github.com/plasticityai/magnitude/blob/master/ELMo.md
> [3] https://github.com/plasticityai/magnitude/blob/master/ELMo.md
>
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to