Hello,
Reading around on the theory behind tree based regression, I concluded
that there are various reasons to stop exploring the tree when a given
node has been reached. Among these, I have those two:
1. When starting to process a node, if its size (row count) is less than
X then consider it a leaf
2. When a split for a node is considered, if any side of the split has
its size less than Y, then ignore it when selecting the best split
As an example, let's consider a node with 45 rows, that for a given
split creates two children, containing 5 and 35 rows respectively.
If I set X to 50, then the node is a leaf and no split is attempted
if I set X to 10 and Y to 15, then the splits are computed but because
one of them has less than 15 rows, that split is ignored.
I'm using DecisionTreeRegressor and RandomForestRegressor on our data
and because the former is implemented using the latter, they both share
the same parameters.
Going through those parameters, I found minInstancesPerNode which to me
is the Y value, but I could not find any parameter for the X value.
Have I missed something?
If not, would there be a way to implement this?
Regards
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org