Alexandr Shapkin created IGNITE-20139:
-----------------------------------------
Summary: RandomForestClassifierTrainer is checking the same
conditions
Key: IGNITE-20139
URL: https://issues.apache.org/jira/browse/IGNITE-20139
Project: Ignite
Issue Type: Bug
Components: ml
Affects Versions: 2.15
Reporter: Alexandr Shapkin
Attachments: TreeSample2_Portfolio_Change.png, random-forest.zip
We tried to use GridGain's machine learning capabilities, and discovered a bug
in GG's implementation of Random Forest. When comparing GG's output with python
prototype (scikit-learn lib), we noticed that GG's predictions have much lower
accuracy despite using the same data set and model parameters.
Further investigation showed that GridGain generates decision trees that kinda
"loop". The tree starts checking the same condition over and over until it
reaches the maximum tree depth.
I've attached a standalone reproducer which uses a small excerpt of our data
set.
It loads data from the csv file, then performs the training of the model for
just 1 tree. Then the reproducer finds one of the looping branches and prints
it. You will see that every single node in the branch uses the same feature,
value and has then same calculated impurity.
On my machine the code reproduces this issue 100% of time.
I've also attached an example of the tree generated by python's scikit-learn on
the same data set with the same parameters. In python the tree usually doesn't
get deeper than 20 nodes.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)