[jira] Commented: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Ted Dunning (JIRA) Thu, 10 Dec 2009 14:02:43 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788961#action_12788961
 ]


Ted Dunning commented on MAHOUT-216:
------------------------------------


Couldn't you just resort the data using random keys?  That leaves you with as 
many or few files as you like and allows you to do the split any way you like 
at learning time.

> Improve the results of MAHOUT-145 by uniformly distributing the classes in 
> the partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-216
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>
> the poor results of the partial decision forest implementation may be 
> explained by the particular distribution of the partitioned data. For 
> example, if a partition does not contain any instance of a given class, the 
> decision trees built using this partition won't be able to classify this 
> class. 
> According to [CHAN, 95]:
> {quote}
> Random Selection of the partitioned data sets with a uniform distribution of 
> classes is perhaps the most sensible solution. Here we may attempt to 
> maintain the same frequency distribution over the ''class attribute" so that 
> each partition represents a good but a smaller model of the entire training 
> set
> {quote}
> [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable 
> Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

Reply via email to