Hi Ted, Looks like I misread your original post (which had error.) I must confess that I did not get 100% of what you said in clarification. Nevertheless, it seemed to help in resolving the problem.
I normalized the input vectors by using L2 norm. (Does that statement sound right?) That is each term weight was divided by the root of the sum of squares of weights. There is no change in the centroid calculation. Centroid remains as average weight (sum of weights / numPoints). Ran Canopy followed by K means clusters to get results. Results look good now. The weights in centroid vary between 1e-3 and 1e-7. So, that is as expected. Deciding threshold for canopy generation looks be tricky. T1, T2 values of 1.3 and 0.9 produce 145 canopies. Changing these values to 1.4 and 1.0 result into a single canopy. >From this issue, it seems the input vectors should be L1/L2 normalized. Is it a good idea to always normalize the input document vectors? If yes, can we make appropriate changes to JIRA 126 (create document vectors from text)? --shashi On Sat, May 30, 2009 at 1:00 AM, Ted Dunning <[email protected]> wrote: > On Thu, May 28, 2009 at 10:56 PM, Shashikant Kore <[email protected]>wrote: > >> I tried L1 and L2 norms. The centroid definitely looks better, but the >> values are still close to zero. >> > > How close is that? 1e-3? (that I would expect) or 1e-300? (that would be > wrong) > > > >> Please let me know if my understanding of L1, L2 norms is correct as >> shown with the code below. >> > > You understood what I said, but I said the wrong thing. See my (oops) > posting a few messages back. >
