That's fine, clustering should be included in all the rest of the job consistency work too. On LDA at least, if you look at the driver its taking the -1 default from the options builder and setting topic smoothing to 50/numTopics. Can't really pass that default into the options builder since it has not yet read the other options. Good catch on -k though, for Dirichlet it is required. I'll change the option to .withRequired(false) and add .withRequired(true) in the Dirichlet jobs which do require it.

In general, since different algorithms have different required options, perhaps it would be best to have the DefaultOptionCreator not set this for any options and do the required/optional determination in the various drivers.

On 5/22/10 6:31 PM, Robin Anil (JIRA) wrote:
      [ 
https://issues.apache.org/jira/browse/MAHOUT-294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-294:
------------------------------

     Component/s: Clustering

Adding clustering back. Saw some bugs

KMeans put the -k parameter as required=true. So It was overwriting centroids 
even when not specified, instead of reading it
LDA: Topic smoothing was changed to default of -1 (it should be 50/numTopics)

Uniform API behavior for Jobs
-----------------------------

                 Key: MAHOUT-294
                 URL: https://issues.apache.org/jira/browse/MAHOUT-294
             Project: Mahout
          Issue Type: Improvement
          Components: Classification, Clustering, Collaborative Filtering, 
Frequent Itemset/Association Rule Mining, Genetic Algorithms, Math, Utils
    Affects Versions: 0.4
            Reporter: Robin Anil
             Fix For: 0.4


* Move AbstractJob to common and convert all the Driver classes to extend that.
    One suggestion is:
    AlgorithmParams params = ParamsBuilder.build().withParam("-i", 
input).withParam("-o", output)....
    MyAlgorithmn.runJob(params) throws ParameterMissingException;
* Give uniform command-line parameters for various algorithms.
    e.g Currently distance measure is -d, -dm, -m at different places in 
clustering
* Add a temp directory as a parameter 
http://www.lucidimagination.com/search/document/28a979aa62c02a1/who_owns_mahout_bucket_on_s3#ddb5855e8bdace45
This issue will keep track of all discussion/patches related to the design and 
cleanup of Mahout API

Reply via email to