Thanks for the clarification Ted. Thanks & Regards, B Anil Kumar.
On Mon, Jul 14, 2014 at 3:47 AM, Ted Dunning <[email protected]> wrote: > On Sun, Jul 13, 2014 at 7:19 AM, AnilKumar B <[email protected]> > wrote: > > > Is it numerical vectorization only for performance optimization? or is > > there any other reason. > > > > Does it make sense to apply clustering directly on actual records? > > > > You can define distance measures on the original data, but you can also > pretty much also define numerical vectorizations which allow those same > distance measures to be calculated on the vectorized form. Distance > measures which have complex forms which are not computable in this way > will, in many cases, defeat clustering algorithms since assumptions about > the topological space implied by the distance function are often baked into > these algorithms. > > A good example of this is the triangle inequality. Using Elkan's > optimization can improve clustering speed by as much as 10x in some cases, > but if your distance doesn't satisfy this, then the optimization becomes > incorrect. > > On the other hand, it is easy to guarantee that any distance that is > computed by first vectorizing and then using a standard distance works > correctly. >
