[
https://issues.apache.org/jira/browse/MAHOUT-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129338#comment-13129338
]
Jeff Eastman commented on MAHOUT-766:
-------------------------------------
I can duplicate this issue; however, I am not convinced it is uncovering a
defect for the following reasons:
- when I run clusterdump on all the clusters-x directories, what I see is that
fuzzyk is actually converging on the reported cluster definitions. The initial
clusters are significantly different, but after three iterations they have all
converged upon the k, identical clusters.
- since fuzzyk assigns each point to each cluster with a weight inversely
proportional to its distance from the cluster it would be expected that
clusters would tend to overlap, at least. You can see this by running
DisplayFuzzyKmeans. With m=2 the clusters overlap significantly, with m=1.1
they are much more disjoint and look, as advertised, more like kmeans. This
seems reasonable to me.
I don't have a lot of intuition about how fuzzyk should behave with a text
clustering problem like reuters. Pallavi and Grant have their fingerprints on
the MAHOUT-74 issue which created this implementation, but a lot of others,
including me, have been in the code. Is this a defect or just a consequence of
this algorithm running with these arguments on this data?
> fuzzy kmeans - all cluster with the same top terms
> ---------------------------------------------------
>
> Key: MAHOUT-766
> URL: https://issues.apache.org/jira/browse/MAHOUT-766
> Project: Mahout
> Issue Type: Bug
> Components: Clustering, Examples
> Affects Versions: 0.6
> Environment: tested in OSX and linux
> Reporter: Paulo Magalhaes
>
> believe there is something wrong with fkmeans in trunk.
> I am using code from trunk (last checkout 6/30/11). To recreate is very
> simple:
> 1) change examples/bin/build-reuters.sh to use fkmeans and set -m 2
> 2) run build-reuters.sh
> 3) Dump the cluster. I'm doing: ../../bin/mahout clusterdump -dt sequencefile
> -s ./mahout-work/reuters-kmeans/clusters-6 -b 100 -o
> ./reuters-clusterdump.txt -d
> ./mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0
> here is what the clusters look like:
> SV-15898{n=34 c=[0:0.020, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.7254762602900604
> mln => 1.2510936664951733
> dlrs => 1.1340145215097008
> 3 => 1.0643797240793276
> pct => 1.0422760712239152
> reuter => 1.0202689935247569
> its => 0.9997771992646881
> from => 0.9903731234557381
> year => 0.8855389859684145
> vs => 0.8291746545786391
> :SV-14766{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.6406710289350412
> mln => 1.2174993414858022
> dlrs => 1.0937941570322955
> 3 => 1.0334420773050856
> pct => 0.991539915235039
> reuter => 0.990042452019326
> its => 0.9508638527143669
> from => 0.9403885495991262
> vs => 0.865437130369746
> year => 0.8463503194752994
> :SV-14854{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.641260962665307
> mln => 1.217806578134094
> dlrs => 1.0941157210136143
> 3 => 1.0336934328877394
> pct => 0.991895013999163
> reuter => 0.9902889592990656
> its => 0.9512076670014483
> from => 0.9407384847445094
> vs => 0.8653426311034671
> year => 0.8466407590692175
> :SV-14890{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.6410352907185948
> mln => 1.21769021136256
> dlrs => 1.0939933408434481
> 3 => 1.0335977297579235
> pct => 0.991759193577722
> reuter => 0.9901951250301172
> its => 0.9510761761632947
> from => 0.9406047832581563
> vs => 0.8653814488835572
> year => 0.8465301083353372
> :SV-14972{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.640981249652196
> mln => 1.2176595452829564
> dlrs => 1.093962519439548
> 3 => 1.0335737897463568
> pct => 0.9917266257955816
> reuter => 0.9901715950801396
> its => 0.9510446208123859
> from => 0.9405723357372776
> vs => 0.8653843699725567
> year => 0.846502466267153
> :SV-15023{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.6399319888551425
> mln => 1.217099157115808
> dlrs => 1.0933830369192543
> 3 => 1.033121271434882
> pct => 0.991094828319561
> reuter => 0.9897275313905611
> its => 0.9504327303592046
> from => 0.9399480272494183
> vs => 0.8655203514280634
> year => 0.8459804922897428
> :SV-15330{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.6411480082558068
> mln => 1.217746071140758
> dlrs => 1.0940532425506244
> 3 => 1.0336447143638317
> pct => 0.9918269975797083
> reuter => 0.990241145450359
> its => 0.9511417993006985
> from => 0.9406712099799636
> vs => 0.8653569180999117
> year => 0.8465844425179013
> :SV-15403{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.6493270418577013
> mln => 1.221708475489808
> dlrs => 1.0983489300320377
> 3 => 1.0370024996153944
> pct => 0.9967446058994232
> reuter => 0.993528974793619
> its => 0.9558988111209523
> from => 0.9454911460774864
> vs => 0.8633642497287671
> year => 0.8505083085439775
> :SV-15514{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.6414524586689534
> mln => 1.2179029815366167
> dlrs => 1.094218299808865
> 3 => 1.033773769117182
> pct => 0.9920102286561391
> reuter => 0.9903676795676004
> its => 0.9513191861395162
> from => 0.9408515920762511
> vs => 0.865304353452142
> year => 0.8467337135094862
> :SV-15549{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.640632892454694
> mln => 1.2174764812983898
> dlrs => 1.0937717467869699
> 3 => 1.033424727632325
> pct => 0.99151691360307
> reuter => 0.9900253758026865
> its => 0.9508415534060888
> from => 0.9403654699584985
> vs => 0.865436402399392
> year => 0.8463303217162843
> :SV-15616{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.6402745961421197
> mln => 1.217287104215781
> dlrs => 1.0935749393200054
> 3 => 1.0332709291683844
> pct => 0.9913012005612369
> reuter => 0.9898744911012118
> its => 0.9506326562835085
> from => 0.9401525895225771
> vs => 0.8654873596392523
> year => 0.8461528918952358
> :SV-15674{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.6402335213893247
> mln => 1.2172651791725515
> dlrs => 1.0935522610806727
> 3 => 1.0332532137000938
> pct => 0.991276468108388
> reuter => 0.9898571070574692
> its => 0.9506087026962596
> from => 0.9401281555632803
> vs => 0.8654927058873914
> year => 0.8461324681573653
> :SV-15720{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.641454220566282
> mln => 1.2179063418879368
> dlrs => 1.0942205822099829
> 3 => 1.0337754035575257
> pct => 0.9920113271819195
> reuter => 0.9903693325123661
> its => 0.9513202705619623
> from => 0.9408530174807668
> vs => 0.8653096216062077
> year => 0.8467355860669477
> :SV-15732{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.6418679366988789
> mln => 1.218118262616823
> dlrs => 1.0944441677361394
> 3 => 1.0339502052648608
> pct => 0.9922602967957669
> reuter => 0.9905406967751569
> its => 0.9515612774046113
> from => 0.941098001639954
> vs => 0.865235154416334
> year => 0.8469379811534101
> :SV-15825{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.6403540331112847
> mln => 1.2173302824011656
> dlrs => 1.0936192179118565
> 3 => 1.0333054698476525
> pct => 0.9913490440255205
> reuter => 0.9899084014354236
> its => 0.9506790000021428
> from => 0.9401999656754023
> vs => 0.8654787849286104
> year => 0.8461927112339609
> :SV-15888{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.641852069569193
> mln => 1.218106579705691
> dlrs => 1.0944336674208315
> 3 => 1.0339422184421034
> pct => 0.9922506923700831
> reuter => 0.9905327937543529
> its => 0.951551949990525
> from => 0.9410880514065464
> vs => 0.8652299423273659
> year => 0.8469287549740471
> :SV-15944{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.6406094746503062
> mln => 1.2174640910103491
> dlrs => 1.0937588768380255
> 3 => 1.0334146735611798
> pct => 0.9915028147402405
> reuter => 0.9900155118531778
> its => 0.9508279001565995
> from => 0.9403515526055797
> vs => 0.865439705916966
> year => 0.846318717539638
> :SV-15952{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.641608350634413
> mln => 1.2179827157677379
> dlrs => 1.094302484756082
> 3 => 1.033839606583586
> pct => 0.9921040410110572
> reuter => 0.990432219413613
> its => 0.9514099986904929
> from => 0.9409438763575203
> vs => 0.8652760331837802
> year => 0.8468099163160301
> :SV-15954{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.6429205353451672
> mln => 1.2186434984636658
> dlrs => 1.0950054459143779
> 3 => 1.0343894404834142
> pct => 0.992893505149969
> reuter => 0.9909710261706427
> its => 0.9521740690117075
> from => 0.9417194634871013
> vs => 0.8650137662755684
> year => 0.8474476266423354
> :SV-16007{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.6401767760282457
> mln => 1.2172339691485916
> dlrs => 1.093520432998812
> 3 => 1.0332284013507513
> pct => 0.9912422858233993
> reuter => 0.9898327402827573
> its => 0.9505755879363272
> from => 0.9400942591120444
> vs => 0.8654979916098049
> year => 0.8461038772989482
> :SV-16037{n=36 c=[0:0.019, 0.003:0.001, 0.006913:0.001, 0.01:0.004,
> 0.02:0.002, 0.03:0.001, 0.046:0.0
> Top Terms:
> said => 1.640610618380475
> mln => 1.2174645746382695
> dlrs => 1.0937594396319776
> 3 => 1.0334151203058977
> pct => 0.9915035014016228
> reuter => 0.9900159476830741
> its => 0.9508285640147016
> from => 0.9403522136131415
> vs => 0.8654392679742507
> year => 0.846319234572972
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
