As a rule of thumb, to use word embedding effectively, how big ought my corpus 
be and what ought to be the size of each feature's vector?

I have been experimenting with word embedding (a la word2vec). [1] My corpus (a 
subset of the EEBO, ECCO, and Sabin collections) contains approximately 2.3 
billion words. It can be logically & easily sub-divided into smaller corpora. 
Even considering the high-performance computing resources at my disposal, 
creating a word2vec binary file is not trivial. The process requires a lot RAM, 
disk space, and CPU cycles. Once I create a word2vec binary file, I can easily 
query it through using the word2vec tools or through the use of library such as 
the one supported by Gensim. [2] I am getting interesting results. For example, 
based on models created from different centuries of content, I can demonstrate 
changes of politics as well as changes in the definition of love.

I want to use word embedding on smaller corpora, but don't know how small is 
too small. Nor do I have an idea how large each feature's vector must be in 
order to be useful. To what degree will word embedding work on things like the 
size of a novel, and if they can be effective on a document that small, then 
what might be a recommended value for a vector's size when creating the model? 
Similarly, if my corpus is a billion words in size, then how many dimensions 
ought to the size of each vector?

Fun with natural language processing and machine learning.

[1] word2vec - https://github.com/tmikolov/word2vec
[2] Gensim - https://radimrehurek.com/gensim/models/word2vec.html

--
Eric Morgan

Reply via email to