On Sun, Jul 13, 2014 at 7:19 AM, AnilKumar B <[email protected]> wrote:

> Is it numerical vectorization only for performance optimization? or is
> there any other reason.
>
> Does it make sense to apply clustering directly on actual records?
>

You can define distance measures on the original data, but you can also
pretty much also define numerical vectorizations which allow those same
distance measures to be calculated on the vectorized form.  Distance
measures which have complex forms which are not computable in this way
will, in many cases, defeat clustering algorithms since assumptions about
the topological space implied by the distance function are often baked into
these algorithms.

A good example of this is the triangle inequality.  Using Elkan's
optimization can improve clustering speed by as much as 10x in some cases,
but if your distance doesn't satisfy this, then the optimization becomes
incorrect.

On the other hand, it is easy to guarantee that any distance that is
computed by first vectorizing and then using a standard distance works
correctly.

Reply via email to