[ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-11:
------------------------------

    Attachment: MAHOUT-11-RandomSeedGenerator.patch

Found the problem, which I believe is isolated to the case where kmeans cluster 
uses random seed clusters as a basis for clustering.

In RandomSeedGenerator, no cluster ids are assigned, so all clusters generated 
get an id of zero when being written to the sequence file. If all cluster id's 
are zero, KmeansClusterer.outputPointWithClusterInfo winds up assigning all 
points to the same cluster. This issue was hidden previously because Cluster 
id's were assigned in the Cluster(Vector) constructor. 

I've attached a small patch for RandomSeedGenerator. This should probably be 
accompanied by a unit test, but I haven't had the chance to put one together.



> Static fields used throughout clustering code (Canopy, K-Means).
> ----------------------------------------------------------------
>
>                 Key: MAHOUT-11
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-11
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Dawid Weiss
>             Fix For: 0.3
>
>         Attachments: MAHOUT-11-RandomSeedGenerator.patch, MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to