[ https://issues.apache.org/jira/browse/IGNITE-8059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anton Dmitriev updated IGNITE-8059: ----------------------------------- Description: A partition based dataset (new underlying infrastructure component) was added as part of IGNITE-7437 and now we need to adopt decision tree algorithm to work on top of this infrastructure. ---- The way decision tree algorithm is implemented on top of a row-partitioned data is described further. At first, the basic idea behind any decision tree, bother regression and classification, is to find the *data split* that allows to minimize an *impurity measure* like [Gini coefficient|https://en.wikipedia.org/wiki/Gini_coefficient], [entropy|https://en.wikipedia.org/wiki/Entropy_(information_theory)] or [mean squared error|https://en.wikipedia.org/wiki/Mean_squared_error]. To calculate the best split we need to build a _function_ that describes dependency between split point (independent variable) and impurity measure (dependent variable) and then find a minimum of this _function_. In case of a distributed system, when a data is partitioned by row, we can calculate such _function_ on every node, compress it somehow, and then pass it to the master node. On the master node we need to summarize _functions_ received from all nodes and then find a minimum of the result _function_. It's the way decision tree algorithm is implemented in Apache Ignite ML module. was: A partition based dataset (new underlying infrastructure component) was added as part of IGNITE-7437 and now we need to adopt decision tree algorithm to work on top of this infrastructure. ---- The way decision tree algorithm is implemented on top of a row-partitioned data is described further. At first, the basic idea behind any decision tree, bother regression and classification, is to find the *data split* that allows to minimize an *impurity measure* like [Gini coefficient|[https://en.wikipedia.org/wiki/Gini_coefficient]] [entropy|https://en.wikipedia.org/wiki/Entropy_(information_theory)] or [mean squared error|[https://en.wikipedia.org/wiki/Mean_squared_error]]. To calculate the best split we need to build a _function_ that describes dependency between split point (independent variable) and impurity measure (dependent variable) and then find a minimum of this _function_. In case of a distributed system, when a data is partitioned by row, we can calculate such _function_ on every node, compress it somehow, and then pass it to the master node. On the master node we need to summarize _functions_ received from all nodes and then find a minimum of the result _function_. It's the way decision tree algorithm is implemented in Apache Ignite ML module. > Integrate decision tree with partition based dataset > ---------------------------------------------------- > > Key: IGNITE-8059 > URL: https://issues.apache.org/jira/browse/IGNITE-8059 > Project: Ignite > Issue Type: Improvement > Components: ml > Reporter: Anton Dmitriev > Assignee: Anton Dmitriev > Priority: Major > Fix For: 2.5 > > > A partition based dataset (new underlying infrastructure component) was added > as part of IGNITE-7437 and now we need to adopt decision tree algorithm to > work on top of this infrastructure. > ---- > The way decision tree algorithm is implemented on top of a row-partitioned > data is described further. > At first, the basic idea behind any decision tree, bother regression and > classification, is to find the *data split* that allows to minimize an > *impurity measure* like [Gini > coefficient|https://en.wikipedia.org/wiki/Gini_coefficient], > [entropy|https://en.wikipedia.org/wiki/Entropy_(information_theory)] or [mean > squared error|https://en.wikipedia.org/wiki/Mean_squared_error]. To calculate > the best split we need to build a _function_ that describes dependency > between split point (independent variable) and impurity measure (dependent > variable) and then find a minimum of this _function_. > In case of a distributed system, when a data is partitioned by row, we can > calculate such _function_ on every node, compress it somehow, and then pass > it to the master node. On the master node we need to summarize _functions_ > received from all nodes and then find a minimum of the result _function_. > It's the way decision tree algorithm is implemented in Apache Ignite ML > module. -- This message was sent by Atlassian JIRA (v7.6.3#76005)