Re: [Jprogramming] Vector Similarity

Raul Miller Tue, 20 Feb 2018 12:28:17 -0800

I don't know about blog entries - I think there are probably some that
partially cover this topic.


But it shouldn't be hard to implement most of these operations:

Euclidean distance:

   1 0 0 +/&.:*:@:- 0 1 0
1.41421

Manhattan distance:

   1 0 0 +/@:|@:- 0 1 0
2

Minkowski distances:

   minkowski=: 1 :'m %: [:+/ m ^~ [:| -'
   1 0 0 (1 minkowski) 0 1 0
2
   1 0 0 (2 minkowski) 0 1 0
1.41421

Cosine similarity:

   prod=:+/ .*
   1 0 0 (prod % %:@*&prod) 0 1 0
0

Jacard Similarity:

   union=: ~.@,
   intersect=: [ ~.@:-. -.
   1 0 0 (intersect %&# union) 0 1 0
1

You'll probably want to use these at rank 1 ("1) if you're operating
on collections of vectors.

But, I'm a little dubious about the usefulness of Jacard Similarity,
because of the assumptions it brings to bear (you're basically
encoding sets as vectors, which means your multidimensional vector
space is just a way of encoding a single unordered dimension).

Anyways, I hope this helps,

-- 
Raul



On Tue, Feb 20, 2018 at 2:08 PM, Skip Cave <[email protected]> wrote:
> One of the hottest topics in data science today is the representation of
> data characteristics using large multi-dimensional arrays. Each datum is
> represented as a data point or multi-element vector in an array that can
> have hundreds of dimensions. In these arrays, each dimension represents a
> different attribute of the data.
>
> Much useful information can be gleaned by examining the similarity, or
> distance between vectors in the array. However, there are many different
> ways to measure the similarity of two or more vectors in a multidimensional
> space.
>
> Some common similarity/distance measures:
>
> 1. Euclidean distance <https://en.wikipedia.org/wiki/Euclidean_distance>:
> The length of the line between two data points
>
> 2. Manhattan distance <https://en.wikipedia.org/wiki/Taxicab_geometry>: Also
> known as Manhattan length, rectilinear distance, L1 distance or L1 norm,
> city block distance, Minkowski’s L1 distance, taxi-cab metric, or city
> block distance.
>
> 3. Minkowski distance: <https://en.wikipedia.org/wiki/Minkowski_distance> a
> generalized metric form of Euclidean distance and Manhattan distance.
>
> 4. Cosine similarity: <https://en.wikipedia.org/wiki/Cosine_similarity> The
> cosine of the angle between two vectors. The cosine will be between 0 &1,
> where 1 is alike, and 0 is not alike.
>
> 5
> <https://i2.wp.com/dataaspirant.com/wp-content/uploads/2015/04/minkowski.png>.
> Jacard Similarity: <https://en.wikipedia.org/wiki/Jaccard_index> The
> cardinality of
> the intersection of sets divided by the cardinality of the union of the
> sample sets.
>
> Each of these metrics is useful in specific data analysis situations.
>
> In many cases, one also wants to know the similarity between clusters of
> points, or a point and a cluster of points. In these cases, the centroid of
> a set of points is also a useful metric to have, which can then be used
> with the various distance/similarity measurements.
>
> Is there any essay or blog covering these common metrics using the J
> language? I would seem that J is perfectly suited for calculating these
> metrics, but I haven't been able to find anything much on this topic on the
> J software site. I thought I would ask on this forum, before I go off to
> see what my rather rudimentary J skills can come up with.
>
> Skip
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Vector Similarity

Reply via email to