[ 
https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859693#action_12859693
 ] 

Robin Anil commented on MAHOUT-384:
-----------------------------------

Hi Tony. Nice work on the patch. But before we commit this, there are a couple 
of things you need to cover. I still have to read the algorithm in detail to 
know whats happening. But I have some queries and suggestions below which is a 
kind of a checklist to make this a commitable patch

1) I am not a fan of Text based input, though it is what most of the algorithms 
in Mahout was first implement in. The idea of splitting and joining text files 
based on comma is not very clean. Can you convert this to deal with 
SequenceFile of VectorWritable OR some other Writable Format? Whats your input 
schema?
2) There is a code-style we enforce in Mahout. You can use the mvn 
checkstyle:checkstyle to see the violations. We also have an eclipse formatter 
which formats code that almost match the checkstyle(there are rare manual 
interventions required). Take a look at this 
https://cwiki.apache.org/MAHOUT/howtocontribute.html you will find the Eclipse 
formatter file at the bottom
3) For parsing args use the apache commons cli2 library. Take a look at 
o/a/m/clustering/kmeans/KMeansDriver to see usage
4) What is Utils being used for?
5) @Override
+       public void setup(Context context) throws 
IOException,InterruptedException{
+
+               String filePath = context.getConfiguration().get("a");
+               sumAttribute = Utils.readFile(filePath+"/part-r-00000");
+               
+       }
Please use distributed cache to read the file in a map/reduce context. See the 
DictionaryVectorizer Map/Reduce classes for usage
6) job.setNumReduceTasks(1); ? Is this necessary? Doesn't it hurt scalability 
of this algorithm? Is the single reducer going to get a lot of data from the 
mapper? If Yes, then you should think of removing this constraint and let it 
use the hadoop parameters or parameterize it
7) Can this job be Optimised using a Combiner? If yes, its really worth 
spending time to make one
8) Tests! :)

> Implement of AVF algorithm
> --------------------------
>
>                 Key: MAHOUT-384
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-384
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: tony cui
>         Attachments: mahout-384.patch
>
>
> This program realize a outlier detection algorithm called avf, which is kind 
> of 
> Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and 
> introduced by this paper : 
>     http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
> Following is an example how to run this program under haodoop:
> $hadoop jar programName.jar avfDriver inputData interTempData outputData
> The output data contains ordered avfValue in the first column, followed by 
> original input data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to