On 01/16/2012 09:44 AM, Andreas wrote:
> Hi Everybody.
> I'm still trying to hack at the trees. This time I stumbled across the
> computation of the Gini index.
> Could someone please explain this to me?
> Hastie, Tishirani and Friedman told me this is computed as
> \sum_{k} p_{mk}*(1- p_{mk})
> where k enumerates the classes and m denotes a node (I guess that
> means in the end, one sums over m)
>
> It is not clear to me how what is done in the code is equivalent to this.
> If I understood correctly, this is what the code does:
>
> (\sum_m (n_m**2 - \sum_k n_{mk}**2) / n_m ) / sum_m n_m
>
> where n_{mk} denotes the count of class k in node m,
> and n_m is the total count of points in node m.
>
> If I compute both values for the split left=(3,1), right=(1,2),
> I end up with 59/72 for the first formula and 19/42 for the second formular.
>
> Can someone tell me what I got wrong?
>
I think I found my mistake. The Gini indexes of the nodes
are not just summed up but weighted with their counts.
------------------------------------------------------------------------------
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!
http://p.sf.net/sfu/rsa-sfdev2dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general