[
https://issues.apache.org/jira/browse/IGNITE-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexandr Shapkin updated IGNITE-20139:
--------------------------------------
Summary: RandomForestClassifierTrainer accuracy issue (was:
RandomForestClassifierTrainer is checking the same conditions)
> RandomForestClassifierTrainer accuracy issue
> --------------------------------------------
>
> Key: IGNITE-20139
> URL: https://issues.apache.org/jira/browse/IGNITE-20139
> Project: Ignite
> Issue Type: Bug
> Components: ml
> Affects Versions: 2.15
> Reporter: Alexandr Shapkin
> Priority: Major
> Attachments: TreeSample2_Portfolio_Change.png, random-forest.zip
>
>
> We tried to use GridGain's machine learning capabilities, and discovered a
> bug in GG's implementation of Random Forest. When comparing GG's output with
> python prototype (scikit-learn lib), we noticed that GG's predictions have
> much lower accuracy despite using the same data set and model parameters.
> Further investigation showed that GridGain generates decision trees that
> kinda "loop". The tree starts checking the same condition over and over until
> it reaches the maximum tree depth.
> I've attached a standalone reproducer which uses a small excerpt of our data
> set.
> It loads data from the csv file, then performs the training of the model for
> just 1 tree. Then the reproducer finds one of the looping branches and prints
> it. You will see that every single node in the branch uses the same feature,
> value and has then same calculated impurity.
> On my machine the code reproduces this issue 100% of time.
> I've also attached an example of the tree generated by python's scikit-learn
> on the same data set with the same parameters. In python the tree usually
> doesn't get deeper than 20 nodes.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)