Raul, Well, one way to start word2vec* is to* assign a boolean vector to each word, with a single boolean one in a different place for each unique word. That's why it's called 'one hot' embedding. However, after training the word set with a shallow, two-layer neural network, and doing significant dimensionality reduction, *then* you get the final word vectors you describe.
Another way to start off word2vec is to assign random values to the vector elements for each word, and then go through the same neural network training and dimensionality reduction. Both ways get reasonable results. The good news is that all the work has been done on generat text corpora and has been open sourced using the three methods I mentioned earlier. Of course, if you want a domain specific similarity matrix, you will have to do the work yourself. The data I described in my "File Cleanup" post is from the "numberbatch" word similarity data. Once I get that data into a form that I can work on in J, we'll have some fun. Here's the numberbatch files: VersionMultilingualEnglish-onlyHDF5 17.06 numberbatch-17.06.txt.gz <https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-17.06.txt.gz> numberbatch-en-17.06.txt.gz <https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.06.txt.gz> 17.06/mini.h5 <http://conceptnet.s3.amazonaws.com/precomputed-data/2016/numberbatch/17.06/mini.h5> Skip Skip Cave Cave Consulting LLC On Wed, Feb 21, 2018 at 2:30 AM, Raul Miller <[email protected]> wrote: > Skip, > > Are you sure you have the word2vec description right? > > https://en.wikipedia.org/wiki/Word2vec claims the dimensionality of > word2vec is typically in the range of 100 . .. 1000, which would allow > treatment of a rather limited vocabulary if each dimension > corresponded to a distinct word. > > The idea I get, from reading > .https://www.tensorflow.org/tutorials/word2vec is that word2vec > instead creates an arbitrary n dimensional vector for each vocabulary > word (the example code they showed used a 120 element vector to > represent a word, though that writeup also considered the example of a > 2 dimensional vector). Each dimension of each of these vectors is > initialized with a random value in the range [0..1]. Then they extract > adjacent word pairs from a body of text and train a neural network on > the corresponding word distances as having positive results vs. > arbitrary word pairs (which the net is trained as having negative > results). > > There's more details (some of which has to do with how many > repetitions of each pair they wind up using), but that gets into all > the usual problems and issues with tuning neural net training... > > Thanks, > > -- > Raul > > > > > > > > On Tue, Feb 20, 2018 at 8:44 PM, Skip Cave <[email protected]> > wrote: > > Bo, > > > > I read your paper "Ordinal Fractions > > <https://drive.google.com/file/d/15-1Z75tkBBgIj7IbNeRy8SpWZc93UuPq > /view?usp=sharing>" > > paper. In your paper you propose ordinal fractions as a system for > > numerically defining data categories and indices, along with an > associated > > algebra for manipulating the data. As you say in your paper, "the > algebraic > > properties of ordinal fractions depict closely the logical properties of > > natural language". That's intriguing to me, as NLP is the avenue of > > research that I am currently pursuing. > > > > Your examples show how you can take a set of human characteristics: boy, > > girl; naughty, nice; black hair, blonde; and encode those characteristics > > in a numerical index. > > > > Recently, several attempts have again been made to define the meanings of > > words numerically. The most notable recent attempts were Miklov's > Word2Vec > > <https://en.wikipedia.org/wiki/Word2vec> and Pennington's GloVe > > <https://nlp.stanford.edu/pubs/glove.pdf>. > > > > Word2Vec uses a technique they call 'one hot' word embeddings. This > encodes > > each word as a lengthy unique vector containing all zeros except for a > > single one representing the meaning of that specific word. The number of > > zeros in the vector represents the number of different words > > available They then train shallow, two-layer neural networks > > <https://en.wikipedia.org/wiki/Neural_network> to reconstruct linguistic > > contexts of words by comparing adjacent words. > > > > GloVe uses a specific weighted least squares model that trains on global > > word-word co-occurrence counts and makes efficient use of statistics for > > the model. The gloVe model produces a word vector space with meaningful > > substructure, as evidenced by its state-of-the-art performance of 75% > > accuracy on the word analogy data set. > > > > Both word2vec and gloVe provide open-source datasets which provide word > > vectors that have been trained on massive text corpora. Each dataset > claims > > to represent the meaning of each of the words in their dataset by a > lengthy > > numerical vector. > > > > A third, even more recent word vector database has been created by a > > company called Luminoso. They have combined the word2vec and gloVe > datasets > > into a new set they call Numberbatch <https://goo.gl/XC7A77>, which they > > have also open-sourced. > > > > With all these word embedding data sets, you theoretically should be able > > to pick a word, and then find other words with similar meanings by simply > > finding nearby words, in the Euclidean sense of distance in a > > multi-dimensional space. So these datasets are kind of like a numerical > > Thesarus. Even more interesting is that parallel vectors in that space > have > > related meanings: man is to woman, as king is to queen. The vectors > between > > these two pairs of words with similar meanings are parallel, and of > > similar length. > > > > Skip > > > > > > Skip Cave > > Cave Consulting LLC > > > > On Tue, Feb 20, 2018 at 4:20 PM, 'Bo Jacoby' via Programming < > > [email protected]> wrote: > > > >> ORDINAL FRACTIONS - the algebra of data > >> > >> | > >> | > >> | > >> | | | > >> > >> | > >> > >> | > >> | > >> | | > >> ORDINAL FRACTIONS - the algebra of data > >> This paper was submitted to the 10th World Computer Congress, IFIP 1986 > >> conference, but rejected by the referee.... | | > >> > >> | > >> > >> | > >> > >> > >> > >> > >> Den 22:42 tirsdag den 20. februar 2018 skrev Skip Cave < > >> [email protected]>: > >> > >> > >> Very nice! Thanks Raul. > >> > >> However, there is something wrong about the cosine similarity, > >> which should always be between 0 & 1 > >> > >> prod=:+/ .* > >> > >> 1 1 1 (prod % %:@*&prod) 0 3 3 > >> > >> 1.41421 > >> > >> Skip > >> > >> On Tue, Feb 20, 2018 at 2:27 PM, Raul Miller <[email protected]> > >> wrote: > >> > >> > I don't know about blog entries - I think there are probably some that > >> > partially cover this topic. > >> > > >> > But it shouldn't be hard to implement most of these operations: > >> > > >> > Euclidean distance: > >> > > >> > 1 0 0 +/&.:*:@:- 0 1 0 > >> > 1.41421 > >> > > >> > Manhattan distance: > >> > > >> > 1 0 0 +/@:|@:- 0 1 0 > >> > 2 > >> > > >> > Minkowski distances: > >> > > >> > minkowski=: 1 :'m %: [:+/ m ^~ [:| -' > >> > 1 0 0 (1 minkowski) 0 1 0 > >> > 2 > >> > 1 0 0 (2 minkowski) 0 1 0 > >> > 1.41421 > >> > > >> > Cosine similarity: > >> > > >> > prod=:+/ .* > >> > 1 0 0 (prod % %:@*&prod) 0 1 0 > >> > 0 > >> > > >> > Jacard Similarity: > >> > > >> > union=: ~.@, > >> > intersect=: [ ~.@:-. -. > >> > 1 0 0 (intersect %&# union) 0 1 0 > >> > 1 > >> > > >> > You'll probably want to use these at rank 1 ("1) if you're operating > >> > on collections of vectors. > >> > > >> > But, I'm a little dubious about the usefulness of Jacard Similarity, > >> > because of the assumptions it brings to bear (you're basically > >> > encoding sets as vectors, which means your multidimensional vector > >> > space is just a way of encoding a single unordered dimension). > >> > > >> > Anyways, I hope this helps, > >> > > >> > -- > >> > Raul > >> > > >> > > >> > > >> > On Tue, Feb 20, 2018 at 2:08 PM, Skip Cave <[email protected]> > >> > wrote: > >> > > One of the hottest topics in data science today is the > representation > >> of > >> > > data characteristics using large multi-dimensional arrays. Each > datum > >> is > >> > > represented as a data point or multi-element vector in an array that > >> can > >> > > have hundreds of dimensions. In these arrays, each dimension > >> represents a > >> > > different attribute of the data. > >> > > > >> > > Much useful information can be gleaned by examining the similarity, > or > >> > > distance between vectors in the array. However, there are many > >> different > >> > > ways to measure the similarity of two or more vectors in a > >> > multidimensional > >> > > space. > >> > > > >> > > Some common similarity/distance measures: > >> > > > >> > > 1. Euclidean distance <https://en.wikipedia.org/ > >> wiki/Euclidean_distance > >> > >: > >> > > The length of the line between two data points > >> > > > >> > > 2. Manhattan distance <https://en.wikipedia.org/ > wiki/Taxicab_geometry > >> >: > >> > Also > >> > > known as Manhattan length, rectilinear distance, L1 distance or L1 > >> norm, > >> > > city block distance, Minkowski’s L1 distance, taxi-cab metric, or > city > >> > > block distance. > >> > > > >> > > 3. Minkowski distance: <https://en.wikipedia.org/ > >> wiki/Minkowski_distance> > >> > a > >> > > generalized metric form of Euclidean distance and Manhattan > distance. > >> > > > >> > > 4. Cosine similarity: <https://en.wikipedia.org/ > wiki/Cosine_similarity > >> > > >> > The > >> > > cosine of the angle between two vectors. The cosine will be between > 0 > >> &1, > >> > > where 1 is alike, and 0 is not alike. > >> > > > >> > > 5 > >> > > <https://i2.wp.com/dataaspirant.com/wp-content/ > >> > uploads/2015/04/minkowski.png>. > >> > > Jacard Similarity: <https://en.wikipedia.org/wiki/Jaccard_index> > The > >> > > cardinality of > >> > > the intersection of sets divided by the cardinality of the union of > the > >> > > sample sets. > >> > > > >> > > Each of these metrics is useful in specific data analysis > situations. > >> > > > >> > > In many cases, one also wants to know the similarity between > clusters > >> of > >> > > points, or a point and a cluster of points. In these cases, the > >> centroid > >> > of > >> > > a set of points is also a useful metric to have, which can then be > used > >> > > with the various distance/similarity measurements. > >> > > > >> > > Is there any essay or blog covering these common metrics using the J > >> > > language? I would seem that J is perfectly suited for calculating > these > >> > > metrics, but I haven't been able to find anything much on this > topic on > >> > the > >> > > J software site. I thought I would ask on this forum, before I go > off > >> to > >> > > see what my rather rudimentary J skills can come up with. > >> > > > >> > > Skip > >> > > ------------------------------------------------------------ > ---------- > >> > > For information about J forums see http://www.jsoftware.com/ > forums.htm > >> > ------------------------------------------------------------ > ---------- > >> > For information about J forums see http://www.jsoftware.com/ > forums.htm > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > >> > >> > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > >> > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
