It seems that split will always stop when count of nodes is less than max(X, Y). Hence, are they different?
On Tue, Jun 27, 2017 at 11:07 PM, OBones <obo...@free.fr> wrote: > Hello, > > Reading around on the theory behind tree based regression, I concluded > that there are various reasons to stop exploring the tree when a given node > has been reached. Among these, I have those two: > > 1. When starting to process a node, if its size (row count) is less than X > then consider it a leaf > 2. When a split for a node is considered, if any side of the split has its > size less than Y, then ignore it when selecting the best split > > As an example, let's consider a node with 45 rows, that for a given split > creates two children, containing 5 and 35 rows respectively. > > If I set X to 50, then the node is a leaf and no split is attempted > if I set X to 10 and Y to 15, then the splits are computed but because one > of them has less than 15 rows, that split is ignored. > > I'm using DecisionTreeRegressor and RandomForestRegressor on our data and > because the former is implemented using the latter, they both share the > same parameters. > Going through those parameters, I found minInstancesPerNode which to me is > the Y value, but I could not find any parameter for the X value. > Have I missed something? > If not, would there be a way to implement this? > > Regards > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >