Re: [Jprogramming] Vector Similarity

R.E. Boss Wed, 21 Feb 2018 08:32:50 -0800

Why not submit it to arXiv.org?


R.E. Boss


> -----Original Message-----
> From: Programming [mailto:[email protected]]
> On Behalf Of 'Bo Jacoby' via Programming
> Sent: woensdag 21 februari 2018 09:07
> To: [email protected]
> Subject: Re: [Jprogramming] Vector Similarity
> 
> Thank you, Skip!
> I tried to publish an Ordinal Fraction article in wikipedia, but it was 
> removed
> because original research is not allowed in wikipedia. However somebody
> copied it into this link: StateMaster - Encyclopedia: Ordinal fraction . I 
> think
> that the most obvious use of ordinal fractions is to unify, simplify, and
> replace: arrays, trees, and relational databases. It is scaringly easy.
> 
> |
> |
> |
> |   |    |
> 
>    |
> 
>   |
> |
> |   |
> StateMaster - Encyclopedia: Ordinal fraction
>    |   |
> 
>   |
> 
>   |
> 
> 
> 
> 
>     Den 2:45 onsdag den 21. februar 2018 skrev Skip Cave
> <[email protected]>:
> 
> 
>  Bo,
> 
> I read your paper "Ordinal Fractions
> <https://drive.google.com/file/d/15-
> 1Z75tkBBgIj7IbNeRy8SpWZc93UuPq/view?usp=sharing>"
> paper. In your paper you propose ordinal fractions as a system for
> numerically defining data categories and indices, along with an associated
> algebra for manipulating the data. As you say in your paper, "the algebraic
> properties of ordinal fractions depict closely the logical properties of 
> natural
> language". That's intriguing to me, as NLP is the avenue of research that I am
> currently pursuing.
> 
> Your examples show how you can take a set of human characteristics: boy,
> girl; naughty, nice; black hair, blonde; and encode those characteristics in a
> numerical index.
> 
> Recently, several attempts have again been made to define the meanings of
> words numerically. The most notable recent attempts were Miklov's
> Word2Vec <https://en.wikipedia.org/wiki/Word2vec> and Pennington's
> GloVe <https://nlp.stanford.edu/pubs/glove.pdf>.
> 
> Word2Vec uses a technique they call 'one hot' word embeddings. This
> encodes each word as a lengthy unique vector containing all zeros except for
> a single one representing the meaning of that specific word. The number of
> zeros in the vector represents the number of different words available  They
> then train shallow, two-layer neural networks
> <https://en.wikipedia.org/wiki/Neural_network> to reconstruct linguistic
> contexts of words by comparing adjacent words.
> 
> GloVe uses a specific weighted least squares model that trains on global
> word-word co-occurrence counts and makes efficient use of statistics for the
> model.  The gloVe model produces a word vector space with meaningful
> substructure, as evidenced by its state-of-the-art performance of 75%
> accuracy on the word analogy data set.
> 
> Both word2vec and gloVe provide open-source datasets which  provide word
> vectors that have been trained on massive text corpora. Each dataset claims
> to represent the meaning of each of the words in their dataset by a lengthy
> numerical vector.
> 
> A third, even more recent word vector database has been created by a
> company called Luminoso. They have combined the word2vec and gloVe
> datasets into a new set they call Numberbatch <https://goo.gl/XC7A77>,
> which they have also open-sourced.
> 
> With all these word embedding data sets, you theoretically should be able to
> pick a word, and then find other words with similar meanings by simply
> finding nearby words, in the Euclidean sense of distance in a multi-
> dimensional space. So these datasets are kind of like a numerical Thesarus.
> Even more interesting is that parallel vectors in that space have related
> meanings: man is to woman, as king is to queen. The vectors between these
> two pairs of  words with similar meanings are parallel, and of similar length.
> 
> Skip
> 
> 
> Skip Cave
> Cave Consulting LLC
> 
> On Tue, Feb 20, 2018 at 4:20 PM, 'Bo Jacoby' via Programming <
> [email protected]> wrote:
> 
> > ORDINAL FRACTIONS - the algebra of data
> >
> > |
> > |
> > |
> > |  |    |
> >
> >    |
> >
> >  |
> > |
> > |    |
> > ORDINAL FRACTIONS - the algebra of data
> >  This paper was submitted to the 10th World Computer Congress, IFIP
> >1986  conference, but rejected by the referee....  |  |
> >
> >  |
> >
> >  |
> >
> >
> >
> >
> >    Den 22:42 tirsdag den 20. februar 2018 skrev Skip Cave <
> > [email protected]>:
> >
> >
> >  Very nice! Thanks Raul.
> >
> > However, there is something wrong about the cosine similarity, which
> > should always be between 0 & 1
> >
> > prod=:+/ .*
> >
> > 1 1 1 (prod % %:@*&prod) 0 3 3
> >
> > 1.41421
> >
> > Skip
> >
> > On Tue, Feb 20, 2018 at 2:27 PM, Raul Miller <[email protected]>
> > wrote:
> >
> > > I don't know about blog entries - I think there are probably some
> > > that partially cover this topic.
> > >
> > > But it shouldn't be hard to implement most of these operations:
> > >
> > > Euclidean distance:
> > >
> > >    1 0 0 +/&.:*:@:- 0 1 0
> > > 1.41421
> > >
> > > Manhattan distance:
> > >
> > >    1 0 0 +/@:|@:- 0 1 0
> > > 2
> > >
> > > Minkowski distances:
> > >
> > >    minkowski=: 1 :'m %: [:+/ m ^~ [:| -'
> > >    1 0 0 (1 minkowski) 0 1 0
> > > 2
> > >    1 0 0 (2 minkowski) 0 1 0
> > > 1.41421
> > >
> > > Cosine similarity:
> > >
> > >    prod=:+/ .*
> > >    1 0 0 (prod % %:@*&prod) 0 1 0
> > > 0
> > >
> > > Jacard Similarity:
> > >
> > >    union=: ~.@,
> > >    intersect=: [ ~.@:-. -.
> > >    1 0 0 (intersect %&# union) 0 1 0
> > > 1
> > >
> > > You'll probably want to use these at rank 1 ("1) if you're operating
> > > on collections of vectors.
> > >
> > > But, I'm a little dubious about the usefulness of Jacard Similarity,
> > > because of the assumptions it brings to bear (you're basically
> > > encoding sets as vectors, which means your multidimensional vector
> > > space is just a way of encoding a single unordered dimension).
> > >
> > > Anyways, I hope this helps,
> > >
> > > --
> > > Raul
> > >
> > >
> > >
> > > On Tue, Feb 20, 2018 at 2:08 PM, Skip Cave <[email protected]>
> > > wrote:
> > > > One of the hottest topics in data science today is the
> > > > representation
> > of
> > > > data characteristics using large multi-dimensional arrays. Each
> > > > datum
> > is
> > > > represented as a data point or multi-element vector in an array
> > > > that
> > can
> > > > have hundreds of dimensions. In these arrays, each dimension
> > represents a
> > > > different attribute of the data.
> > > >
> > > > Much useful information can be gleaned by examining the
> > > > similarity, or distance between vectors in the array. However,
> > > > there are many
> > different
> > > > ways to measure the similarity of two or more vectors in a
> > > multidimensional
> > > > space.
> > > >
> > > > Some common similarity/distance measures:
> > > >
> > > > 1. Euclidean distance <https://en.wikipedia.org/
> > wiki/Euclidean_distance
> > > >:
> > > > The length of the line between two data points
> > > >
> > > > 2. Manhattan distance
> > > > <https://en.wikipedia.org/wiki/Taxicab_geometry
> > >:
> > > Also
> > > > known as Manhattan length, rectilinear distance, L1 distance or L1
> > norm,
> > > > city block distance, Minkowski’s L1 distance, taxi-cab metric, or
> > > > city block distance.
> > > >
> > > > 3. Minkowski distance: <https://en.wikipedia.org/
> > wiki/Minkowski_distance>
> > > a
> > > > generalized metric form of Euclidean distance and Manhattan distance.
> > > >
> > > > 4. Cosine similarity:
> > > > <https://en.wikipedia.org/wiki/Cosine_similarity
> > >
> > > The
> > > > cosine of the angle between two vectors. The cosine will be
> > > > between 0
> > &1,
> > > > where 1 is alike, and 0 is not alike.
> > > >
> > > > 5
> > > > <https://i2.wp.com/dataaspirant.com/wp-content/
> > > uploads/2015/04/minkowski.png>.
> > > > Jacard Similarity: <https://en.wikipedia.org/wiki/Jaccard_index>
> > > > The cardinality of the intersection of sets divided by the
> > > > cardinality of the union of the sample sets.
> > > >
> > > > Each of these metrics is useful in specific data analysis situations.
> > > >
> > > > In many cases, one also wants to know the similarity between
> > > > clusters
> > of
> > > > points, or a point and a cluster of points. In these cases, the
> > centroid
> > > of
> > > > a set of points is also a useful metric to have, which can then be
> > > > used with the various distance/similarity measurements.
> > > >
> > > > Is there any essay or blog covering these common metrics using the
> > > > J language? I would seem that J is perfectly suited for
> > > > calculating these metrics, but I haven't been able to find
> > > > anything much on this topic on
> > > the
> > > > J software site. I thought I would ask on this forum, before I go
> > > > off
> > to
> > > > see what my rather rudimentary J skills can come up with.
> > > >
> > > > Skip
> > > > ------------------------------------------------------------------
> > > > ---- For information about J forums see
> > > > http://www.jsoftware.com/forums.htm
> > > --------------------------------------------------------------------
> > > -- For information about J forums see
> > > http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see
> http://www.jsoftware.com/forums.htm
> >
> >
> > ----------------------------------------------------------------------
> > For information about J forums see
> http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
> 
> 
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Vector Similarity

Reply via email to