Re: [Jprogramming] Vector Similarity

'Bo Jacoby' via Programming Wed, 21 Feb 2018 00:07:34 -0800

Thank you, Skip!
I tried to publish an Ordinal Fraction article in wikipedia, but it was removed 
because original research is not allowed in wikipedia. However somebody copied 
it into this link: StateMaster - Encyclopedia: Ordinal fraction . I think that 
the most obvious use of ordinal fractions is to unify, simplify, and replace: 
arrays, trees, and relational databases. It is scaringly easy. 
  
|  
|   
|   
|   |    |

   |

  |
|  
|   |  
StateMaster - Encyclopedia: Ordinal fraction
   |   |

  |

  |

    Den 2:45 onsdag den 21. februar 2018 skrev Skip Cave 
<[email protected]>:

 Bo,

I read your paper "Ordinal Fractions
<https://drive.google.com/file/d/15-1Z75tkBBgIj7IbNeRy8SpWZc93UuPq/view?usp=sharing>"
paper. In your paper you propose ordinal fractions as a system for
numerically defining data categories and indices, along with an associated
algebra for manipulating the data. As you say in your paper, "the algebraic
properties of ordinal fractions depict closely the logical properties of
natural language". That's intriguing to me, as NLP is the avenue of
research that I am currently pursuing.

Your examples show how you can take a set of human characteristics: boy,
girl; naughty, nice; black hair, blonde; and encode those characteristics
in a numerical index.

Recently, several attempts have again been made to define the meanings of
words numerically. The most notable recent attempts were Miklov's Word2Vec
<https://en.wikipedia.org/wiki/Word2vec> and Pennington's GloVe
<https://nlp.stanford.edu/pubs/glove.pdf>.

Word2Vec uses a technique they call 'one hot' word embeddings. This encodes
each word as a lengthy unique vector containing all zeros except for a
single one representing the meaning of that specific word. The number of
zeros in the vector represents the number of different words
available  They then train shallow, two-layer neural networks
<https://en.wikipedia.org/wiki/Neural_network> to reconstruct linguistic
contexts of words by comparing adjacent words.

GloVe uses a specific weighted least squares model that trains on global
word-word co-occurrence counts and makes efficient use of statistics for
the model.  The gloVe model produces a word vector space with meaningful
substructure, as evidenced by its state-of-the-art performance of 75%
accuracy on the word analogy data set.

Both word2vec and gloVe provide open-source datasets which  provide word
vectors that have been trained on massive text corpora. Each dataset claims
to represent the meaning of each of the words in their dataset by a lengthy
numerical vector.

A third, even more recent word vector database has been created by a
company called Luminoso. They have combined the word2vec and gloVe datasets
into a new set they call Numberbatch <https://goo.gl/XC7A77>, which they
have also open-sourced.

With all these word embedding data sets, you theoretically should be able
to pick a word, and then find other words with similar meanings by simply
finding nearby words, in the Euclidean sense of distance in a
multi-dimensional space. So these datasets are kind of like a numerical
Thesarus. Even more interesting is that parallel vectors in that space have
related meanings: man is to woman, as king is to queen. The vectors between
these two pairs of  words with similar meanings are parallel, and of
similar length.

Skip

Skip Cave
Cave Consulting LLC

On Tue, Feb 20, 2018 at 4:20 PM, 'Bo Jacoby' via Programming <
[email protected]> wrote:

> ORDINAL FRACTIONS - the algebra of data
>
> |
> |
> |
> |  |    |
>
>    |
>
>  |
> |
> |    |
> ORDINAL FRACTIONS - the algebra of data
>  This paper was submitted to the 10th World Computer Congress, IFIP 1986
> conference, but rejected by the referee....  |  |
>
>  |
>
>  |
>
>
>
>
>    Den 22:42 tirsdag den 20. februar 2018 skrev Skip Cave <
> [email protected]>:
>
>
>  Very nice! Thanks Raul.
>
> However, there is something wrong about the cosine similarity,
> which should always be between 0 & 1
>
> prod=:+/ .*
>
> 1 1 1 (prod % %:@*&prod) 0 3 3
>
> 1.41421
>
> Skip
>
> On Tue, Feb 20, 2018 at 2:27 PM, Raul Miller <[email protected]>
> wrote:
>
> > I don't know about blog entries - I think there are probably some that
> > partially cover this topic.
> >
> > But it shouldn't be hard to implement most of these operations:
> >
> > Euclidean distance:
> >
> >    1 0 0 +/&.:*:@:- 0 1 0
> > 1.41421
> >
> > Manhattan distance:
> >
> >    1 0 0 +/@:|@:- 0 1 0
> > 2
> >
> > Minkowski distances:
> >
> >    minkowski=: 1 :'m %: [:+/ m ^~ [:| -'
> >    1 0 0 (1 minkowski) 0 1 0
> > 2
> >    1 0 0 (2 minkowski) 0 1 0
> > 1.41421
> >
> > Cosine similarity:
> >
> >    prod=:+/ .*
> >    1 0 0 (prod % %:@*&prod) 0 1 0
> > 0
> >
> > Jacard Similarity:
> >
> >    union=: ~.@,
> >    intersect=: [ ~.@:-. -.
> >    1 0 0 (intersect %&# union) 0 1 0
> > 1
> >
> > You'll probably want to use these at rank 1 ("1) if you're operating
> > on collections of vectors.
> >
> > But, I'm a little dubious about the usefulness of Jacard Similarity,
> > because of the assumptions it brings to bear (you're basically
> > encoding sets as vectors, which means your multidimensional vector
> > space is just a way of encoding a single unordered dimension).
> >
> > Anyways, I hope this helps,
> >
> > --
> > Raul
> >
> >
> >
> > On Tue, Feb 20, 2018 at 2:08 PM, Skip Cave <[email protected]>
> > wrote:
> > > One of the hottest topics in data science today is the representation
> of
> > > data characteristics using large multi-dimensional arrays. Each datum
> is
> > > represented as a data point or multi-element vector in an array that
> can
> > > have hundreds of dimensions. In these arrays, each dimension
> represents a
> > > different attribute of the data.
> > >
> > > Much useful information can be gleaned by examining the similarity, or
> > > distance between vectors in the array. However, there are many
> different
> > > ways to measure the similarity of two or more vectors in a
> > multidimensional
> > > space.
> > >
> > > Some common similarity/distance measures:
> > >
> > > 1. Euclidean distance <https://en.wikipedia.org/
> wiki/Euclidean_distance
> > >:
> > > The length of the line between two data points
> > >
> > > 2. Manhattan distance <https://en.wikipedia.org/wiki/Taxicab_geometry
> >:
> > Also
> > > known as Manhattan length, rectilinear distance, L1 distance or L1
> norm,
> > > city block distance, Minkowski’s L1 distance, taxi-cab metric, or city
> > > block distance.
> > >
> > > 3. Minkowski distance: <https://en.wikipedia.org/
> wiki/Minkowski_distance>
> > a
> > > generalized metric form of Euclidean distance and Manhattan distance.
> > >
> > > 4. Cosine similarity: <https://en.wikipedia.org/wiki/Cosine_similarity
> >
> > The
> > > cosine of the angle between two vectors. The cosine will be between 0
> &1,
> > > where 1 is alike, and 0 is not alike.
> > >
> > > 5
> > > <https://i2.wp.com/dataaspirant.com/wp-content/
> > uploads/2015/04/minkowski.png>.
> > > Jacard Similarity: <https://en.wikipedia.org/wiki/Jaccard_index> The
> > > cardinality of
> > > the intersection of sets divided by the cardinality of the union of the
> > > sample sets.
> > >
> > > Each of these metrics is useful in specific data analysis situations.
> > >
> > > In many cases, one also wants to know the similarity between clusters
> of
> > > points, or a point and a cluster of points. In these cases, the
> centroid
> > of
> > > a set of points is also a useful metric to have, which can then be used
> > > with the various distance/similarity measurements.
> > >
> > > Is there any essay or blog covering these common metrics using the J
> > > language? I would seem that J is perfectly suited for calculating these
> > > metrics, but I haven't been able to find anything much on this topic on
> > the
> > > J software site. I thought I would ask on this forum, before I go off
> to
> > > see what my rather rudimentary J skills can come up with.
> > >
> > > Skip
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Vector Similarity

Reply via email to