You know what's even more odd? When I used Mahout's KMeans, everything was
assigned to one single cluster with mean distance 64.


On Thu, Mar 28, 2013 at 11:07 AM, Ted Dunning <[email protected]> wrote:

> Hmm... looking at these outputs, it looks like the big cluster is really
> tight ... much tighter than cluster 3 or 4.  That is very odd.
>
> On Thu, Mar 28, 2013 at 10:01 AM, Dan Filimon
> <[email protected]>wrote:
>
> > [Yes, it should be on the dev list. I got confused.]
> >
> > The thing is, it's happening when using just 1 mapper. The hypercube
> > tests indicate that the 3 versions of StreamingKMeans produce about
> > the same results.
> > I haven't tested them on the _unprojected_ vectors though.
> >
> > Average distance in cluster 0 [18773]: 68.237385
> > Average distance in cluster 1 [2]: 5.973227
> > Average distance in cluster 2 [1]: 0.000000
> > Average distance in cluster 3 [4]: 279.200390
> > Average distance in cluster 4 [5]: 394.101672
> > Average distance in cluster 5 [4]: 227.845612
> > Average distance in cluster 6 [1]: 0.000000
> > Average distance in cluster 7 [2]: 28.779806
> > Average distance in cluster 8 [1]: 0.000000
> > Average distance in cluster 9 [2]: 215.254876
> > Average distance in cluster 10 [3]: 128.501163
> > Average distance in cluster 11 [8]: 534.401649
> > Average distance in cluster 12 [1]: 0.000000
> > Average distance in cluster 13 [5]: 405.115140
> > Average distance in cluster 14 [1]: 0.000000
> > Average distance in cluster 15 [9]: 215.797289
> > Average distance in cluster 16 [1]: 0.000000
> > Average distance in cluster 17 [2]: 123.065677
> > Average distance in cluster 18 [1]: 0.000000
> > Average distance in cluster 19 [2]: 98.733778
> > Num clusters: 20; maxDistance: 762.326896
> >
> > On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <[email protected]>
> > wrote:
> > > I will have to think on this a bit.
> > >
> > > It should be possible to dump the sketches coming from each mapper and
> > look
> > > at them for compatibility.
> > >
> > > Are the mappers seeing only docs from a single news group?  That might
> > > produce some interesting and odd results.
> > >
> > > What happens with the sequential version when you specify as many
> threads
> > > as you have mappers in the MR version?
> > >
> > > Also, sholdn't this be on the dev list?
> > >
> > > On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon <
> > [email protected]>wrote:
> > >
> > >> So no, apparently the problem's still there. With the most recent
> code,
> > I
> > >> get:
> > >>
> > >> Average distance in cluster 0 [1]: 0.000000
> > >> Average distance in cluster 1 [18775]: 63.839819
> > >> Average distance in cluster 2 [11]: 448.706077
> > >> Average distance in cluster 3 [1]: 0.000000
> > >> Average distance in cluster 4 [8]: 213.629578
> > >> Average distance in cluster 5 [1]: 0.000000
> > >> Average distance in cluster 6 [10]: 369.592682
> > >> Average distance in cluster 7 [1]: 0.000000
> > >> Average distance in cluster 8 [2]: 31.061103
> > >> Average distance in cluster 9 [1]: 0.000000
> > >> Average distance in cluster 10 [2]: 309.934857
> > >> Average distance in cluster 11 [1]: 0.000000
> > >> Average distance in cluster 12 [1]: 0.000000
> > >> Average distance in cluster 13 [1]: 0.000000
> > >> Average distance in cluster 14 [1]: 0.000000
> > >> Average distance in cluster 15 [4]: 229.180504
> > >> Average distance in cluster 16 [1]: 0.000000
> > >> Average distance in cluster 17 [3]: 336.835246
> > >> Average distance in cluster 18 [2]: 76.485594
> > >> Average distance in cluster 19 [1]: 0.000000
> > >> Num clusters: 20; maxDistance: 724.060033
> > >>
> > >> I'll have to recheck. :/
> > >>
> > >> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <[email protected]>
> > >> wrote:
> > >> > Hot damn!
> > >> >
> > >> > Well spotted.
> > >> >
> > >> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
> > >> > <[email protected]>wrote:
> > >> >
> > >> >> Ted, remember we talked about this last week?
> > >> >>
> > >> >> The problem was (I think it's fixed now) that when I was asking for
> > 20
> > >> >> clusters, every mapper would give me 20 clusters (rather than k
> log n
> > >> >> ~ 200) and the points clumped together resulting in one cluster
> with
> > >> >> the vast majority of the points ~17K out the ~19K.
> > >> >>
> > >> >> Now that I fixed that added more tests that seem to be confirming
> all
> > >> >> StreamingKMeans implementations get about the same results (whether
> > >> >> they're local or MapReduce) and the multiple restarts of
> BallKMeans,
> > >> >> I'm expecting it to be a lot better.
> > >> >>
> > >> >> Actual data tests coming soon (please check that new cluster
> > thread). :)
> > >> >>
> > >>
> >
>

Reply via email to