Hello dear R-help members,

I would appreciate any help in understanding how the rpart function computes
the "improve" (which is given in fit$split) when using the
split='information' parameter.

Thanks to Professor Atkinson help, I was able to find how this is done in
the case that split='gini'.  By following the explanation here:
http://mayoresearch.mayo.edu/mayo/research/biostat/upload/61.pdf
But the calculation of the information (deviance) impurity is still a
mystery for me.
Might you help with explaining it?


Bellow is some R code simply showing how the gini is computed (and how the
information is not as clear)


# creating data
set.seed(1324)
y <- sample(c(0,1), 20, T)
x <- y
x[1:5] <- 0

# manually making the first split
obs_L <- y[x<.5]
obs_R <- y[x>.5]
n_L <- sum(x<.5)
n_R <- sum(x>.5)
n <- length(x)


calc.impurity <- function(func = gini)
{
impurity_root <- func(prop.table(table(y)))
impurity_L <- func(prop.table(table(obs_L)))
 impurity_R <-func(prop.table(table(obs_R)))
imp <- impurity_root - ((n_L/n)*impurity_l + (n_R/n)*impurity_R) # 0.3757
 imp*n
}

# for "gini"
require(rpart)
fit <- rpart(y~x, method = "class", parms=list(split='gini'))
fit$split[,3] # 5.384615
gini <- function(p) {sum(p*(1-p))}
calc.impurity(gini) # 5.384615 # success!


# for "information" I fail...

fit <- rpart(y~x, method = "class", parms=list(split='information'))
fit$split[,3] # why is improve here 6.84029 ?

entropy <- function(p) {
if(any(p==1)) return(0) # works for the case when y has only 0 and 1
categories...
 -sum(p*log(p))
}
calc.impurity(entropy) # 9.247559 != 6.84029




Thanks,
Tal


----------------Contact
Details:-------------------------------------------------------
Contact me: [email protected] |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to