[ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788961#action_12788961 ]
Ted Dunning commented on MAHOUT-216: ------------------------------------ Couldn't you just resort the data using random keys? That leaves you with as many or few files as you like and allows you to do the split any way you like at learning time. > Improve the results of MAHOUT-145 by uniformly distributing the classes in > the partitioned data > ----------------------------------------------------------------------------------------------- > > Key: MAHOUT-216 > URL: https://issues.apache.org/jira/browse/MAHOUT-216 > Project: Mahout > Issue Type: Improvement > Components: Classification > Reporter: Deneche A. Hakim > Assignee: Deneche A. Hakim > > the poor results of the partial decision forest implementation may be > explained by the particular distribution of the partitioned data. For > example, if a partition does not contain any instance of a given class, the > decision trees built using this partition won't be able to classify this > class. > According to [CHAN, 95]: > {quote} > Random Selection of the partitioned data sets with a uniform distribution of > classes is perhaps the most sensible solution. Here we may attempt to > maintain the same frequency distribution over the ''class attribute" so that > each partition represents a good but a smaller model of the entire training > set > {quote} > [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable > Data Mining" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.