[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Saikat Kanjilal (Commented) (JIRA) Wed, 04 Apr 2012 21:25:09 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247003#comment-13247003
 ]


Saikat Kanjilal commented on MAHOUT-940:
----------------------------------------

Still reading code to get a deeper understanding of what's happening, some more 
questions:

1)The createClusterWriter method inside ClusterDumper creates 3 types of 
writers depending on the outputFormat, so one of the arguments to these writers 
is the map in question is shown below:

private Map<Integer, List<WeightedVectorWritable>> clusterIdToPoints;

Its not clear to me whether we need to do a deeper refactoring to 
rewrite/replace these different types of writers with the 
ClusterOutputPostProcessor, any thoughts on this, should we have a choice to 
either use the writers or the ClusterOutputPostProcessor?

2) For the following line of code:
long numWritten = clusterWriter.write(new 
SequenceFileDirValueIterable<ClusterWritable>(new Path(seqFileDir, "part-*"), 
PathType.GLOB, conf));

Does the above just use an iterator to dump the points to different directories 
corresponding to the different clusters, the code is really hard to read and 
SequenceFileDirValueIterable is not well commented.

Thanks for your help in getting a better understanding of this.
                
> Clusterdumper - Get rid of map based implementation
> ---------------------------------------------------
>
>                 Key: MAHOUT-940
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-940
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Paritosh Ranjan
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>
> Current implementation of ClusterDumper puts clusters and related vectors in 
> map. This generally results in OOM.
> Since ClusterOutputProcessor is availabale now. The ClusterDumper will at 
> first process the clusteredPoints, and then write down the clusters to a 
> local file. 
> The inability to properly read the clustering output due to ClusterDumper 
> facing OOM is seen too often in the mailing list. This improvement will fix 
> that problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-940) Clusterdumper - Get rid of map based implementation

Reply via email to