[ 
https://issues.apache.org/jira/browse/MAHOUT-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217088#comment-13217088
 ] 

Paritosh Ranjan commented on MAHOUT-929:
----------------------------------------

1) Do not worry about outlier removals for the first cut. Use 
emitMostlikely=true and clusterClassificationThreshold = 0.0.
2) I don't think there is any need to run a Hadoop job to test the mapper. Just 
test the logic inside mapper. You will need EasyMock or some other mocking 
framework to do it. Dev mailing list/other existing tests can help to tell 
other ways to write tests. There is no defined reducer for the job.
3) I don't think there is any need to take the code inside 
ClusterClassificationDriver. The point is to test the cluster classification 
logic inside mapper, not the driver.
4) It does not matter how many clusters you use. What matters is the clarity of 
the test cases. It really helps if the functionality to be tested is 
understandable from the test cases. 
The sequential and mapreduce should produce the same result. So, you can also 
use the assertions and data used in ClusterClassificationDriverTest, which is 
for the sequential cluster classification.
                
> Refactor Clustering (Vector Classification) into a Separate Postprocess with 
> Outlier Pruning
> --------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-929
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-929
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification, Clustering
>    Affects Versions: 0.6
>            Reporter: Jeff Eastman
>            Assignee: Paritosh Ranjan
>             Fix For: 0.7
>
>         Attachments: Mahout-929, Mahout-929, Mahout-929, Mahout-929
>
>
> The current clustering drivers have a -cp option to produce clusteredPoints 
> directory containing the input vectors classified by the final clusters 
> produced by the algorithm. These options are redundantly implemented in those 
> drivers.
> - Factor out & implement an independent post processor to perform the 
> classification step independently of the various clustering implementations.
> - Implement a pluggable outlier removal capability for this classifier. 
> - Consider building off of the ClusterClassifier & ClusterIterator ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to