Hi Shashi,

Please note that we had a CosineDistanceMeasure patch submission in Jira and that I committed it to trunk yesterday. I suspect that may give you better results than EuclideanDistanceMeasure. Please let us know if that is the case.

Jeff

Shashikant Kore wrote:
I get your point.  Thanks you.

I am using Eucleadean Distance.

--shashi

On Thu, May 14, 2009 at 1:51 AM, Jeff Eastman
<[email protected]> wrote:
I think the "optimum" value for these parameters is pretty subjective. You
may find some estimation procedures that will give you values you like some
times, but canopy will put every point into a cluster so the number of
clusters is very sensitive to these values. I don't think normalizing your
vectors will help, since you need to normalize all vectors in your corpus by
the same amount. You might then find t1 and t2 values always on 0..1 but the
number of clusters will still be sensitive to your choices on this range and
you will be dealing with decimal values.

It really depends upon how "similar" the documents in your corpus are and
how fine a distinction you want to draw between documents before declaring
them "different". What kind of distance measure are you using? A cosine
distance measure will always give you distances on 0..1.

Jeff




Reply via email to