On Oct 4, 2013, at 2:16 PM, Mary Kindall <mary.kind...@gmail.com> wrote:
> This reproducible example is from the help of 'gbm' in R. > > I ran the following code in R, and works fine as long as the response is > numeric. The problem starts when I convert the response from numeric to > binary (0/1). It gives me an error. > > My question is, is converting the response from numeric to binary will have > this much effect. > > Help page code: > > N <- 1000 > X1 <- runif(N) > X2 <- 2*runif(N) > X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) > X4 <- factor(sample(letters[1:6],N,replace=TRUE)) > X5 <- factor(sample(letters[1:3],N,replace=TRUE)) > X6 <- 3*runif(N) > mu <- c(-1,0,1,2)[as.numeric(X3)] > > SNR <- 10 # signal-to-noise ratio > Y <- X1**1.5 + 2 * (X2**.5) + mu > sigma <- sqrt(var(Y)/SNR) > Y <- Y + rnorm(N,0,sigma) > > # introduce some missing values > X1[sample(1:N,size=500)] <- NA > X4[sample(1:N,size=300)] <- NA > > data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) > > # fit initial model > gbm1 <- > gbm(Y~X1+X2+X3+X4+X5+X6, # formula > data=data, # dataset > var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, > # +1: monotone increase, > # 0: no monotone restrictions > distribution="gaussian", # see the help for other choices > n.trees=1000, # number of trees > shrinkage=0.05, # shrinkage or learning rate, > # 0.001 to 0.1 usually work > interaction.depth=3, # 1: additive model, 2: two-way > interactions, etc. > bag.fraction = 0.5, # subsampling fraction, 0.5 is probably > best > train.fraction = 0.5, # fraction of data for training, > # first train.fraction*N used for training > n.minobsinnode = 10, # minimum total weight needed in each > node > cv.folds = 3, # do 3-fold cross-validation > keep.data=TRUE, # keep a copy of the dataset with the > object > verbose=FALSE) # don't print out progress > > gbm1 > summary(gbm1) > > > Now I slightly change the response variable to make it binary. > > Y[Y < mean(Y)] = 0 #My edit > Y[Y >= mean(Y)] = 1 #My edit > data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) > fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit > > gbm2 <- > gbm(fmla, # formula > data=data, # dataset > distribution="bernoulli", # My edit > n.trees=1000, # number of trees > shrinkage=0.05, # shrinkage or learning rate, > # 0.001 to 0.1 usually work > interaction.depth=3, # 1: additive model, 2: two-way > interactions, etc. > bag.fraction = 0.5, # subsampling fraction, 0.5 is probably > best > train.fraction = 0.5, # fraction of data for training, > # first train.fraction*N used for training > n.minobsinnode = 10, # minimum total weight needed in each > node > cv.folds = 3, # do 3-fold cross-validation > keep.data=TRUE, # keep a copy of the dataset with the > object > verbose=FALSE) # don't print out progress > > gbm2 > > >> gbm2 > gbm(formula = fmla, distribution = "bernoulli", data = data, > n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10, > shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5, > cv.folds = 3, keep.data = TRUE, verbose = FALSE) > A gradient boosted model with bernoulli loss function. > 1000 iterations were performed. > The best cross-validation iteration was . > The best test-set iteration was . > Error in 1:n.trees : argument of length 0 > > > My question is, Is binarizing the response will have so much effect that it > does not find anythin useful in the predictors? > > Thanks Sure, it's possible. See this page for a good overview of why you should not dichotomize continuous data: http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous Regards, Marc Schwartz ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.