ibelyakov opened a new pull request, #256: URL: https://github.com/apache/ignite-extensions/pull/256
The issue happens when one “pure“ node (with impurity<sup>*</sup> = 0) is presented in the tree. We calculate an impurity only for children nodes and not for the current node, as well as do not check whether the node is “pure“ and contains just one label, due to that, the “bestSplit” calculation is executed for the already “pure“ node, which decides that all items should be moved to the left child node and no items to the right (leaf node), which gives 2 “pure“ children nodes. Since we don’t calculate impurity for the current (parent) node the `parentNode.getImpurity() - split.get().getImpurity() > minImpurityDelta` check is always true, and we continue to split the already “pure“ node until the max tree depth is reached. The following changes were made to resolve the issue: 1. Gain<sup>**</sup> calculation and check for the split were added. 2. Node’s impurity check is added, once the impurity becomes 0 it means that the node is “pure” and we don’t need to calculate a split for it. 3. Gini impurity calculation was changed to `(1 - sum(p^2))` to get the correct values in the range from 0 to 0.5 as required for the Gini index. <sup>*</sup> Impurity - is a value from 0 to 0.5, which shows whether the node is “pure“ (impurity = 0) having just 1 label or “impure” with impurity=0.5, which is the worst scenario where the label ratio is 1:1. <sup>**</sup> Gain - is a difference between the parent node’s impurity and weighted children nodes' impurity. The split which provides the maximum gain value is considered the best. See https://www.learndatasci.com/glossary/gini-impurity/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
