As it happens, there is in fact a formal analytical technique that yields appropriate numbers of "well normalized" dimensions for word embeddings:
https://github.com/ziyin-dl/word-embedding-dimensionality-selection <https://github.com/ziyin-dl/word-embedding-dimensionality-selection?fbclid=IwAR2vMeDqMhUqZlQvLj2tpdMCmR09UUpXSnryS73NEG4R37LLWu0fXUoreFc> That's not to say that Machine Learning requirements should drive SQLite, but the ask is merely to expose a knob with an already defined higher capacity. Also, the correct link for [3] above is here: https://github.com/facebookresearch/InferSent On Sat, Jun 15, 2019 at 6:42 AM Dan Kaminsky <dan.kamin...@medal.com> wrote: > Sqlite3 has something of a normative declaration in its source code: > > * > ** This is the maximum number of > ** > ** * Columns in a table > ** * Columns in an index > ** * Columns in a view > ** * Terms in the SET clause of an UPDATE statement > ** * Terms in the result set of a SELECT statement > ** * Terms in the GROUP BY or ORDER BY clauses of a SELECT statement. > ** * Terms in the VALUES clause of an INSERT statement > ** > ** The hard upper limit here is 32676. Most database people will > ** tell you that in a well-normalized database, you usually should > ** not have more than a dozen or so columns in any table. And if > ** that is the case, there is no point in having more than a few > ** dozen values in any of the other situations described above. > */ > #ifndef SQLITE_MAX_COLUMN > # define SQLITE_MAX_COLUMN 2000 > #endif > > All software has constraints fundamental to its problem set and particular > implementation, and I would not mail you simply for deciding an > architecture on behalf of your users. You do that every day. > > However, conditions have changed since (I expect) this design was > specified. One of the more useful and usable packages for Natural Language > Processing, Magnitude[1], leverages SQLite to efficiently handle the real > valued but entirely abstract collections of numbers -- vector spaces -- > that modern machine learning depends on. > > Traditionally, word2vec and other language vector approaches have fit > their models into tens or hundreds of dimensions. New work is acquiring > the context of each word -- "Huntington", preceeded by "Mr.", vs. > "Huntington" preceeded by "Street" or followed by Disease. These phrase or > sentence embeddings (as they're called) can take quite a bit more space to > represent necessary context. They may take 3072 columns (in the case of > Magnitude's ELMo records[2]) or even 4096 columns (in the case of > InferSent's sentence embeddings[3]). > > That is exceeding SQLite's limits. > > These spaces, though abstract, have become the most powerful way we know > to not merely represent, but actually discover relationships. This is > simply a new data domain that is the *input* to what eventually becomes the > familiarly normalizable relational domain. It's different, but it's not > "wrong". If it's behavior you can support -- as your 32K hard limit > implies -- it would certainly be helpful for these user scenarios. > > I spent quite a bit of time hacking large column support into a working > Python pipeline, and I'd prefer never to run that in production. > Converting this compile time variable into a runtime knob would be > appreciated. > > --Dan > > [1] https://github.com/plasticityai/magnitude > [2] https://github.com/plasticityai/magnitude/blob/master/ELMo.md > [3] https://github.com/plasticityai/magnitude/blob/master/ELMo.md > _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users