You know, regarding the latest clustering with CosineDistance.
How is the _mean_ distance larger (or even close to) 1 if cos is in [-1,
1]? ...


On Thu, Mar 28, 2013 at 10:29 PM, Dan Filimon
<[email protected]>wrote:

> And I'll add that re-vectorizing the documents with my vectorizer yields
> essentially the same results (this is CosineDistance though):
>
> Average distance in cluster 0 [6]: 0.844053
> Average distance in cluster 1 [1047]: 0.988517
> Average distance in cluster 2 [26]: 0.889580
> Average distance in cluster 3 [19]: 0.922804
> Average distance in cluster 4 [2]: 0.414935
> Average distance in cluster 5 [9]: 0.777650
> Average distance in cluster 6 [4]: 0.791443
> Average distance in cluster 7 [17432]: 1.017289
> Average distance in cluster 8 [20]: 0.917523
> Average distance in cluster 9 [4]: 0.744159
> Average distance in cluster 10 [2]: 0.340740
> Average distance in cluster 11 [3]: 0.614734
> Average distance in cluster 12 [2]: 0.624274
> Average distance in cluster 13 [62]: 0.922437
> Average distance in cluster 14 [2]: 0.324862
> Average distance in cluster 15 [1]: 0.000000
> Average distance in cluster 16 [94]: 0.917509
> Average distance in cluster 17 [103]: 0.944392
> Average distance in cluster 18 [7]: 0.795449
> Average distance in cluster 19 [1]: 0.000000
> Num clusters: 20; maxDistance: 1.029701
>
>
> On Thu, Mar 28, 2013 at 6:45 PM, Dan Filimon 
> <[email protected]>wrote:
>
>> You know what's even more odd? When I used Mahout's KMeans, everything
>> was assigned to one single cluster with mean distance 64.
>>
>>
>> On Thu, Mar 28, 2013 at 11:07 AM, Ted Dunning <[email protected]>wrote:
>>
>>> Hmm... looking at these outputs, it looks like the big cluster is really
>>> tight ... much tighter than cluster 3 or 4.  That is very odd.
>>>
>>> On Thu, Mar 28, 2013 at 10:01 AM, Dan Filimon
>>> <[email protected]>wrote:
>>>
>>> > [Yes, it should be on the dev list. I got confused.]
>>> >
>>> > The thing is, it's happening when using just 1 mapper. The hypercube
>>> > tests indicate that the 3 versions of StreamingKMeans produce about
>>> > the same results.
>>> > I haven't tested them on the _unprojected_ vectors though.
>>> >
>>> > Average distance in cluster 0 [18773]: 68.237385
>>> > Average distance in cluster 1 [2]: 5.973227
>>> > Average distance in cluster 2 [1]: 0.000000
>>> > Average distance in cluster 3 [4]: 279.200390
>>> > Average distance in cluster 4 [5]: 394.101672
>>> > Average distance in cluster 5 [4]: 227.845612
>>> > Average distance in cluster 6 [1]: 0.000000
>>> > Average distance in cluster 7 [2]: 28.779806
>>> > Average distance in cluster 8 [1]: 0.000000
>>> > Average distance in cluster 9 [2]: 215.254876
>>> > Average distance in cluster 10 [3]: 128.501163
>>> > Average distance in cluster 11 [8]: 534.401649
>>> > Average distance in cluster 12 [1]: 0.000000
>>> > Average distance in cluster 13 [5]: 405.115140
>>> > Average distance in cluster 14 [1]: 0.000000
>>> > Average distance in cluster 15 [9]: 215.797289
>>> > Average distance in cluster 16 [1]: 0.000000
>>> > Average distance in cluster 17 [2]: 123.065677
>>> > Average distance in cluster 18 [1]: 0.000000
>>> > Average distance in cluster 19 [2]: 98.733778
>>> > Num clusters: 20; maxDistance: 762.326896
>>> >
>>> > On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <[email protected]>
>>> > wrote:
>>> > > I will have to think on this a bit.
>>> > >
>>> > > It should be possible to dump the sketches coming from each mapper
>>> and
>>> > look
>>> > > at them for compatibility.
>>> > >
>>> > > Are the mappers seeing only docs from a single news group?  That
>>> might
>>> > > produce some interesting and odd results.
>>> > >
>>> > > What happens with the sequential version when you specify as many
>>> threads
>>> > > as you have mappers in the MR version?
>>> > >
>>> > > Also, sholdn't this be on the dev list?
>>> > >
>>> > > On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon <
>>> > [email protected]>wrote:
>>> > >
>>> > >> So no, apparently the problem's still there. With the most recent
>>> code,
>>> > I
>>> > >> get:
>>> > >>
>>> > >> Average distance in cluster 0 [1]: 0.000000
>>> > >> Average distance in cluster 1 [18775]: 63.839819
>>> > >> Average distance in cluster 2 [11]: 448.706077
>>> > >> Average distance in cluster 3 [1]: 0.000000
>>> > >> Average distance in cluster 4 [8]: 213.629578
>>> > >> Average distance in cluster 5 [1]: 0.000000
>>> > >> Average distance in cluster 6 [10]: 369.592682
>>> > >> Average distance in cluster 7 [1]: 0.000000
>>> > >> Average distance in cluster 8 [2]: 31.061103
>>> > >> Average distance in cluster 9 [1]: 0.000000
>>> > >> Average distance in cluster 10 [2]: 309.934857
>>> > >> Average distance in cluster 11 [1]: 0.000000
>>> > >> Average distance in cluster 12 [1]: 0.000000
>>> > >> Average distance in cluster 13 [1]: 0.000000
>>> > >> Average distance in cluster 14 [1]: 0.000000
>>> > >> Average distance in cluster 15 [4]: 229.180504
>>> > >> Average distance in cluster 16 [1]: 0.000000
>>> > >> Average distance in cluster 17 [3]: 336.835246
>>> > >> Average distance in cluster 18 [2]: 76.485594
>>> > >> Average distance in cluster 19 [1]: 0.000000
>>> > >> Num clusters: 20; maxDistance: 724.060033
>>> > >>
>>> > >> I'll have to recheck. :/
>>> > >>
>>> > >> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <[email protected]
>>> >
>>> > >> wrote:
>>> > >> > Hot damn!
>>> > >> >
>>> > >> > Well spotted.
>>> > >> >
>>> > >> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon
>>> > >> > <[email protected]>wrote:
>>> > >> >
>>> > >> >> Ted, remember we talked about this last week?
>>> > >> >>
>>> > >> >> The problem was (I think it's fixed now) that when I was asking
>>> for
>>> > 20
>>> > >> >> clusters, every mapper would give me 20 clusters (rather than k
>>> log n
>>> > >> >> ~ 200) and the points clumped together resulting in one cluster
>>> with
>>> > >> >> the vast majority of the points ~17K out the ~19K.
>>> > >> >>
>>> > >> >> Now that I fixed that added more tests that seem to be
>>> confirming all
>>> > >> >> StreamingKMeans implementations get about the same results
>>> (whether
>>> > >> >> they're local or MapReduce) and the multiple restarts of
>>> BallKMeans,
>>> > >> >> I'm expecting it to be a lot better.
>>> > >> >>
>>> > >> >> Actual data tests coming soon (please check that new cluster
>>> > thread). :)
>>> > >> >>
>>> > >>
>>> >
>>>
>>
>>
>

Reply via email to