Why not submit it to arXiv.org?
R.E. Boss > -----Original Message----- > From: Programming [mailto:[email protected]] > On Behalf Of 'Bo Jacoby' via Programming > Sent: woensdag 21 februari 2018 09:07 > To: [email protected] > Subject: Re: [Jprogramming] Vector Similarity > > Thank you, Skip! > I tried to publish an Ordinal Fraction article in wikipedia, but it was > removed > because original research is not allowed in wikipedia. However somebody > copied it into this link: StateMaster - Encyclopedia: Ordinal fraction . I > think > that the most obvious use of ordinal fractions is to unify, simplify, and > replace: arrays, trees, and relational databases. It is scaringly easy. > > | > | > | > | | | > > | > > | > | > | | > StateMaster - Encyclopedia: Ordinal fraction > | | > > | > > | > > > > > Den 2:45 onsdag den 21. februar 2018 skrev Skip Cave > <[email protected]>: > > > Bo, > > I read your paper "Ordinal Fractions > <https://drive.google.com/file/d/15- > 1Z75tkBBgIj7IbNeRy8SpWZc93UuPq/view?usp=sharing>" > paper. In your paper you propose ordinal fractions as a system for > numerically defining data categories and indices, along with an associated > algebra for manipulating the data. As you say in your paper, "the algebraic > properties of ordinal fractions depict closely the logical properties of > natural > language". That's intriguing to me, as NLP is the avenue of research that I am > currently pursuing. > > Your examples show how you can take a set of human characteristics: boy, > girl; naughty, nice; black hair, blonde; and encode those characteristics in a > numerical index. > > Recently, several attempts have again been made to define the meanings of > words numerically. The most notable recent attempts were Miklov's > Word2Vec <https://en.wikipedia.org/wiki/Word2vec> and Pennington's > GloVe <https://nlp.stanford.edu/pubs/glove.pdf>. > > Word2Vec uses a technique they call 'one hot' word embeddings. This > encodes each word as a lengthy unique vector containing all zeros except for > a single one representing the meaning of that specific word. The number of > zeros in the vector represents the number of different words available They > then train shallow, two-layer neural networks > <https://en.wikipedia.org/wiki/Neural_network> to reconstruct linguistic > contexts of words by comparing adjacent words. > > GloVe uses a specific weighted least squares model that trains on global > word-word co-occurrence counts and makes efficient use of statistics for the > model. The gloVe model produces a word vector space with meaningful > substructure, as evidenced by its state-of-the-art performance of 75% > accuracy on the word analogy data set. > > Both word2vec and gloVe provide open-source datasets which provide word > vectors that have been trained on massive text corpora. Each dataset claims > to represent the meaning of each of the words in their dataset by a lengthy > numerical vector. > > A third, even more recent word vector database has been created by a > company called Luminoso. They have combined the word2vec and gloVe > datasets into a new set they call Numberbatch <https://goo.gl/XC7A77>, > which they have also open-sourced. > > With all these word embedding data sets, you theoretically should be able to > pick a word, and then find other words with similar meanings by simply > finding nearby words, in the Euclidean sense of distance in a multi- > dimensional space. So these datasets are kind of like a numerical Thesarus. > Even more interesting is that parallel vectors in that space have related > meanings: man is to woman, as king is to queen. The vectors between these > two pairs of words with similar meanings are parallel, and of similar length. > > Skip > > > Skip Cave > Cave Consulting LLC > > On Tue, Feb 20, 2018 at 4:20 PM, 'Bo Jacoby' via Programming < > [email protected]> wrote: > > > ORDINAL FRACTIONS - the algebra of data > > > > | > > | > > | > > | | | > > > > | > > > > | > > | > > | | > > ORDINAL FRACTIONS - the algebra of data > > This paper was submitted to the 10th World Computer Congress, IFIP > >1986 conference, but rejected by the referee.... | | > > > > | > > > > | > > > > > > > > > > Den 22:42 tirsdag den 20. februar 2018 skrev Skip Cave < > > [email protected]>: > > > > > > Very nice! Thanks Raul. > > > > However, there is something wrong about the cosine similarity, which > > should always be between 0 & 1 > > > > prod=:+/ .* > > > > 1 1 1 (prod % %:@*&prod) 0 3 3 > > > > 1.41421 > > > > Skip > > > > On Tue, Feb 20, 2018 at 2:27 PM, Raul Miller <[email protected]> > > wrote: > > > > > I don't know about blog entries - I think there are probably some > > > that partially cover this topic. > > > > > > But it shouldn't be hard to implement most of these operations: > > > > > > Euclidean distance: > > > > > > 1 0 0 +/&.:*:@:- 0 1 0 > > > 1.41421 > > > > > > Manhattan distance: > > > > > > 1 0 0 +/@:|@:- 0 1 0 > > > 2 > > > > > > Minkowski distances: > > > > > > minkowski=: 1 :'m %: [:+/ m ^~ [:| -' > > > 1 0 0 (1 minkowski) 0 1 0 > > > 2 > > > 1 0 0 (2 minkowski) 0 1 0 > > > 1.41421 > > > > > > Cosine similarity: > > > > > > prod=:+/ .* > > > 1 0 0 (prod % %:@*&prod) 0 1 0 > > > 0 > > > > > > Jacard Similarity: > > > > > > union=: ~.@, > > > intersect=: [ ~.@:-. -. > > > 1 0 0 (intersect %&# union) 0 1 0 > > > 1 > > > > > > You'll probably want to use these at rank 1 ("1) if you're operating > > > on collections of vectors. > > > > > > But, I'm a little dubious about the usefulness of Jacard Similarity, > > > because of the assumptions it brings to bear (you're basically > > > encoding sets as vectors, which means your multidimensional vector > > > space is just a way of encoding a single unordered dimension). > > > > > > Anyways, I hope this helps, > > > > > > -- > > > Raul > > > > > > > > > > > > On Tue, Feb 20, 2018 at 2:08 PM, Skip Cave <[email protected]> > > > wrote: > > > > One of the hottest topics in data science today is the > > > > representation > > of > > > > data characteristics using large multi-dimensional arrays. Each > > > > datum > > is > > > > represented as a data point or multi-element vector in an array > > > > that > > can > > > > have hundreds of dimensions. In these arrays, each dimension > > represents a > > > > different attribute of the data. > > > > > > > > Much useful information can be gleaned by examining the > > > > similarity, or distance between vectors in the array. However, > > > > there are many > > different > > > > ways to measure the similarity of two or more vectors in a > > > multidimensional > > > > space. > > > > > > > > Some common similarity/distance measures: > > > > > > > > 1. Euclidean distance <https://en.wikipedia.org/ > > wiki/Euclidean_distance > > > >: > > > > The length of the line between two data points > > > > > > > > 2. Manhattan distance > > > > <https://en.wikipedia.org/wiki/Taxicab_geometry > > >: > > > Also > > > > known as Manhattan length, rectilinear distance, L1 distance or L1 > > norm, > > > > city block distance, Minkowski’s L1 distance, taxi-cab metric, or > > > > city block distance. > > > > > > > > 3. Minkowski distance: <https://en.wikipedia.org/ > > wiki/Minkowski_distance> > > > a > > > > generalized metric form of Euclidean distance and Manhattan distance. > > > > > > > > 4. Cosine similarity: > > > > <https://en.wikipedia.org/wiki/Cosine_similarity > > > > > > The > > > > cosine of the angle between two vectors. The cosine will be > > > > between 0 > > &1, > > > > where 1 is alike, and 0 is not alike. > > > > > > > > 5 > > > > <https://i2.wp.com/dataaspirant.com/wp-content/ > > > uploads/2015/04/minkowski.png>. > > > > Jacard Similarity: <https://en.wikipedia.org/wiki/Jaccard_index> > > > > The cardinality of the intersection of sets divided by the > > > > cardinality of the union of the sample sets. > > > > > > > > Each of these metrics is useful in specific data analysis situations. > > > > > > > > In many cases, one also wants to know the similarity between > > > > clusters > > of > > > > points, or a point and a cluster of points. In these cases, the > > centroid > > > of > > > > a set of points is also a useful metric to have, which can then be > > > > used with the various distance/similarity measurements. > > > > > > > > Is there any essay or blog covering these common metrics using the > > > > J language? I would seem that J is perfectly suited for > > > > calculating these metrics, but I haven't been able to find > > > > anything much on this topic on > > > the > > > > J software site. I thought I would ask on this forum, before I go > > > > off > > to > > > > see what my rather rudimentary J skills can come up with. > > > > > > > > Skip > > > > ------------------------------------------------------------------ > > > > ---- For information about J forums see > > > > http://www.jsoftware.com/forums.htm > > > -------------------------------------------------------------------- > > > -- For information about J forums see > > > http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > > For information about J forums see > http://www.jsoftware.com/forums.htm > > > > > > ---------------------------------------------------------------------- > > For information about J forums see > http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
