[ https://issues.apache.org/jira/browse/IGNITE-8059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432000#comment-16432000 ]
ASF GitHub Bot commented on IGNITE-8059: ---------------------------------------- Github user asfgit closed the pull request at: https://github.com/apache/ignite/pull/3760 > Integrate decision tree with partition based dataset > ---------------------------------------------------- > > Key: IGNITE-8059 > URL: https://issues.apache.org/jira/browse/IGNITE-8059 > Project: Ignite > Issue Type: Improvement > Components: ml > Reporter: Anton Dmitriev > Assignee: Anton Dmitriev > Priority: Major > Fix For: 2.5 > > > A partition based dataset (new underlying infrastructure component) was added > as part of IGNITE-7437 and now we need to adopt decision tree algorithm to > work on top of this infrastructure. > ---- > The way decision tree algorithm is implemented on top of a row-partitioned > data is described further. > At first, the basic idea behind any decision tree, bother regression and > classification, is to find the *data split* that allows to minimize an > *impurity measure* like [Gini > coefficient|https://en.wikipedia.org/wiki/Gini_coefficient], > [entropy|https://en.wikipedia.org/wiki/Entropy_(information_theory)] or [mean > squared error|https://en.wikipedia.org/wiki/Mean_squared_error]. To calculate > the best split we need to build a _function_ that describes dependency > between split point (independent variable) and impurity measure (dependent > variable) and then find a minimum of this _function_. > In case of a distributed system, when a data is partitioned by row, we can > calculate such _function_ on every node, compress it somehow, and then pass > it to the master node. On the master node we need to summarize _functions_ > received from all nodes and then find a minimum of the result _function_. > It's the way decision tree algorithm is implemented in Apache Ignite ML > module. -- This message was sent by Atlassian JIRA (v7.6.3#76005)