I'd like to implement the test described in this paper [1] and also explained in this presentation [2]. I went over the paper and I think I understand it well enough.
The main gist is that in when dealing with high-dimensional data that has lots of uncorrelated features (which should totally not be the case for us!), distances becomes meaningless as the ratio between minimum distance and maximum distance becomes less than some small constant factor. It's not really about this particular data set, but since I find figuring out whether distances are relevant or not challenging, I feel that any help is welcome. What do you think Ted? [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf [2] http://www.cs.bham.ac.uk/~axk/Dagstuhl.pdf On Thu, Mar 28, 2013 at 10:29 PM, Dan Filimon <[email protected]>wrote: > And I'll add that re-vectorizing the documents with my vectorizer yields > essentially the same results (this is CosineDistance though): > > Average distance in cluster 0 [6]: 0.844053 > Average distance in cluster 1 [1047]: 0.988517 > Average distance in cluster 2 [26]: 0.889580 > Average distance in cluster 3 [19]: 0.922804 > Average distance in cluster 4 [2]: 0.414935 > Average distance in cluster 5 [9]: 0.777650 > Average distance in cluster 6 [4]: 0.791443 > Average distance in cluster 7 [17432]: 1.017289 > Average distance in cluster 8 [20]: 0.917523 > Average distance in cluster 9 [4]: 0.744159 > Average distance in cluster 10 [2]: 0.340740 > Average distance in cluster 11 [3]: 0.614734 > Average distance in cluster 12 [2]: 0.624274 > Average distance in cluster 13 [62]: 0.922437 > Average distance in cluster 14 [2]: 0.324862 > Average distance in cluster 15 [1]: 0.000000 > Average distance in cluster 16 [94]: 0.917509 > Average distance in cluster 17 [103]: 0.944392 > Average distance in cluster 18 [7]: 0.795449 > Average distance in cluster 19 [1]: 0.000000 > Num clusters: 20; maxDistance: 1.029701 > > > On Thu, Mar 28, 2013 at 6:45 PM, Dan Filimon > <[email protected]>wrote: > >> You know what's even more odd? When I used Mahout's KMeans, everything >> was assigned to one single cluster with mean distance 64. >> >> >> On Thu, Mar 28, 2013 at 11:07 AM, Ted Dunning <[email protected]>wrote: >> >>> Hmm... looking at these outputs, it looks like the big cluster is really >>> tight ... much tighter than cluster 3 or 4. That is very odd. >>> >>> On Thu, Mar 28, 2013 at 10:01 AM, Dan Filimon >>> <[email protected]>wrote: >>> >>> > [Yes, it should be on the dev list. I got confused.] >>> > >>> > The thing is, it's happening when using just 1 mapper. The hypercube >>> > tests indicate that the 3 versions of StreamingKMeans produce about >>> > the same results. >>> > I haven't tested them on the _unprojected_ vectors though. >>> > >>> > Average distance in cluster 0 [18773]: 68.237385 >>> > Average distance in cluster 1 [2]: 5.973227 >>> > Average distance in cluster 2 [1]: 0.000000 >>> > Average distance in cluster 3 [4]: 279.200390 >>> > Average distance in cluster 4 [5]: 394.101672 >>> > Average distance in cluster 5 [4]: 227.845612 >>> > Average distance in cluster 6 [1]: 0.000000 >>> > Average distance in cluster 7 [2]: 28.779806 >>> > Average distance in cluster 8 [1]: 0.000000 >>> > Average distance in cluster 9 [2]: 215.254876 >>> > Average distance in cluster 10 [3]: 128.501163 >>> > Average distance in cluster 11 [8]: 534.401649 >>> > Average distance in cluster 12 [1]: 0.000000 >>> > Average distance in cluster 13 [5]: 405.115140 >>> > Average distance in cluster 14 [1]: 0.000000 >>> > Average distance in cluster 15 [9]: 215.797289 >>> > Average distance in cluster 16 [1]: 0.000000 >>> > Average distance in cluster 17 [2]: 123.065677 >>> > Average distance in cluster 18 [1]: 0.000000 >>> > Average distance in cluster 19 [2]: 98.733778 >>> > Num clusters: 20; maxDistance: 762.326896 >>> > >>> > On Thu, Mar 28, 2013 at 10:32 AM, Ted Dunning <[email protected]> >>> > wrote: >>> > > I will have to think on this a bit. >>> > > >>> > > It should be possible to dump the sketches coming from each mapper >>> and >>> > look >>> > > at them for compatibility. >>> > > >>> > > Are the mappers seeing only docs from a single news group? That >>> might >>> > > produce some interesting and odd results. >>> > > >>> > > What happens with the sequential version when you specify as many >>> threads >>> > > as you have mappers in the MR version? >>> > > >>> > > Also, sholdn't this be on the dev list? >>> > > >>> > > On Thu, Mar 28, 2013 at 9:10 AM, Dan Filimon < >>> > [email protected]>wrote: >>> > > >>> > >> So no, apparently the problem's still there. With the most recent >>> code, >>> > I >>> > >> get: >>> > >> >>> > >> Average distance in cluster 0 [1]: 0.000000 >>> > >> Average distance in cluster 1 [18775]: 63.839819 >>> > >> Average distance in cluster 2 [11]: 448.706077 >>> > >> Average distance in cluster 3 [1]: 0.000000 >>> > >> Average distance in cluster 4 [8]: 213.629578 >>> > >> Average distance in cluster 5 [1]: 0.000000 >>> > >> Average distance in cluster 6 [10]: 369.592682 >>> > >> Average distance in cluster 7 [1]: 0.000000 >>> > >> Average distance in cluster 8 [2]: 31.061103 >>> > >> Average distance in cluster 9 [1]: 0.000000 >>> > >> Average distance in cluster 10 [2]: 309.934857 >>> > >> Average distance in cluster 11 [1]: 0.000000 >>> > >> Average distance in cluster 12 [1]: 0.000000 >>> > >> Average distance in cluster 13 [1]: 0.000000 >>> > >> Average distance in cluster 14 [1]: 0.000000 >>> > >> Average distance in cluster 15 [4]: 229.180504 >>> > >> Average distance in cluster 16 [1]: 0.000000 >>> > >> Average distance in cluster 17 [3]: 336.835246 >>> > >> Average distance in cluster 18 [2]: 76.485594 >>> > >> Average distance in cluster 19 [1]: 0.000000 >>> > >> Num clusters: 20; maxDistance: 724.060033 >>> > >> >>> > >> I'll have to recheck. :/ >>> > >> >>> > >> On Thu, Mar 28, 2013 at 2:25 AM, Ted Dunning <[email protected] >>> > >>> > >> wrote: >>> > >> > Hot damn! >>> > >> > >>> > >> > Well spotted. >>> > >> > >>> > >> > On Thu, Mar 28, 2013 at 12:08 AM, Dan Filimon >>> > >> > <[email protected]>wrote: >>> > >> > >>> > >> >> Ted, remember we talked about this last week? >>> > >> >> >>> > >> >> The problem was (I think it's fixed now) that when I was asking >>> for >>> > 20 >>> > >> >> clusters, every mapper would give me 20 clusters (rather than k >>> log n >>> > >> >> ~ 200) and the points clumped together resulting in one cluster >>> with >>> > >> >> the vast majority of the points ~17K out the ~19K. >>> > >> >> >>> > >> >> Now that I fixed that added more tests that seem to be >>> confirming all >>> > >> >> StreamingKMeans implementations get about the same results >>> (whether >>> > >> >> they're local or MapReduce) and the multiple restarts of >>> BallKMeans, >>> > >> >> I'm expecting it to be a lot better. >>> > >> >> >>> > >> >> Actual data tests coming soon (please check that new cluster >>> > thread). :) >>> > >> >> >>> > >> >>> > >>> >> >> >
