[ https://issues.apache.org/jira/browse/FLINK-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544095#comment-14544095 ]
Sachin Goel commented on FLINK-1727: ------------------------------------ The approach in [1] seems the most generic to implement. The major optimization in terms of time is going to come in terms of the number of splits we perform for each attribute, which I think really depends on the data. But from previous experience, a histogram size of 1000 works okay. We can provide some sort of cross validation later on to decide on the size perhaps? > Add decision tree to machine learning library > --------------------------------------------- > > Key: FLINK-1727 > URL: https://issues.apache.org/jira/browse/FLINK-1727 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library > Reporter: Till Rohrmann > Assignee: Mikio Braun > Labels: ML > > Decision trees are widely used for classification and regression tasks. Thus, > it would be worthwhile to add support for them to Flink's machine learning > library. > A streaming parallel decision tree learning algorithm has been proposed by > Ben-Haim and Tom-Tov [1]. This can maybe adapted to a batch use case as well. > [2] contains an overview of different techniques of how to scale inductive > learning algorithms up. A presentation of Spark's MLlib decision tree > implementation can be found in [3]. > Resources: > [1] [http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf] > [2] > [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.8226&rep=rep1&type=pdf] > [3] > [http://spark-summit.org/wp-content/uploads/2014/07/Scalable-Distributed-Decision-Trees-in-Spark-Made-Das-Sparks-Talwalkar.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)