It seems that split will always stop when count of nodes is less than
max(X, Y).
Hence, are they different?



On Tue, Jun 27, 2017 at 11:07 PM, OBones <obo...@free.fr> wrote:

> Hello,
>
> Reading around on the theory behind tree based regression, I concluded
> that there are various reasons to stop exploring the tree when a given node
> has been reached. Among these, I have those two:
>
> 1. When starting to process a node, if its size (row count) is less than X
> then consider it a leaf
> 2. When a split for a node is considered, if any side of the split has its
> size less than Y, then ignore it when selecting the best split
>
> As an example, let's consider a node with 45 rows, that for a given split
> creates two children, containing 5 and 35 rows respectively.
>
> If I set X to 50, then the node is a leaf and no split is attempted
> if I set X to 10 and Y to 15, then the splits are computed but because one
> of them has less than 15 rows, that split is ignored.
>
> I'm using DecisionTreeRegressor and RandomForestRegressor on our data and
> because the former is implemented using the latter, they both share the
> same parameters.
> Going through those parameters, I found minInstancesPerNode which to me is
> the Y value, but I could not find any parameter for the X value.
> Have I missed something?
> If not, would there be a way to implement this?
>
> Regards
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to