[R] Using sample to create Training and Test sets
Forgive the newbie question, I want to select random rows from my data.frame to create a test set (which I can do) but then I want to create a training set using whats left over. Example code: acc - read.table(accOUT.txt, header=T, sep = ,, row.names=1) #select 400 random rows in data training - acc[sample(1:nrow(acc), 400, replace=TRUE),] #try to get whats left of acc not in training testset - acc[-training, ] Fails with the following error Error: invalid subscript type In addition: Warning message: - not meaningful for factors in: Ops.factor(left) I then try. testset - acc[!training, ] Which gives me the warning message ! not meaningful for factors in: Ops.factor(left) And if i look at testset It is 400 rows of NA's ... which clearly isn't right. Can anyone tell me what I'm doing wrong. Thanks in advance Chris __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using sample to create Training and Test sets
Note that the single split sample technique is not competitive with other approaches unless the sample size exceeds around 20,000. Frank Chris Arthur wrote: Forgive the newbie question, I want to select random rows from my data.frame to create a test set (which I can do) but then I want to create a training set using whats left over. Example code: acc - read.table(accOUT.txt, header=T, sep = ,, row.names=1) #select 400 random rows in data training - acc[sample(1:nrow(acc), 400, replace=TRUE),] #try to get whats left of acc not in training testset - acc[-training, ] Fails with the following error Error: invalid subscript type In addition: Warning message: - not meaningful for factors in: Ops.factor(left) I then try. testset - acc[!training, ] Which gives me the warning message ! not meaningful for factors in: Ops.factor(left) And if i look at testset It is 400 rows of NA's ... which clearly isn't right. Can anyone tell me what I'm doing wrong. Thanks in advance Chris -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using sample to create Training and Test sets
Here's one possibility: idx - sample(nrow(acc)) training - acc[idx[1:400], ] testset - acc[-idx[1:400], ] Andy From: Chris Arthur Forgive the newbie question, I want to select random rows from my data.frame to create a test set (which I can do) but then I want to create a training set using whats left over. Example code: acc - read.table(accOUT.txt, header=T, sep = ,, row.names=1) #select 400 random rows in data training - acc[sample(1:nrow(acc), 400, replace=TRUE),] #try to get whats left of acc not in training testset - acc[-training, ] Fails with the following error Error: invalid subscript type In addition: Warning message: - not meaningful for factors in: Ops.factor(left) I then try. testset - acc[!training, ] Which gives me the warning message ! not meaningful for factors in: Ops.factor(left) And if i look at testset It is 400 rows of NA's ... which clearly isn't right. Can anyone tell me what I'm doing wrong. Thanks in advance Chris __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:12}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using sample to create Training and Test sets
Forgive the newbie question, I want to select random rows from my data.frame to create a test set (which I can do) but then I want to create a training set using whats left over. The caret package has a function, createDataPartition, that does the split taking into account the distribution of the outcome. This might be good in classification cases where one or more classes have low percentages in the data set. There is more detail in the pdf: http://cran.r-project.org/web/packages/caret/vignettes/caretMisc.pdf and examples in this pdf http://cran.r-project.org/web/packages/caret/vignettes/caretTrain.pdf Max __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.