Re: [R] Question About Repeat Random Sampling from a Data Frame

Adam Carr Mon, 21 Dec 2009 09:54:20 -0800

Good Afternoon Dr. Winsemius:

You ask some very good questions and make excellent points; my responses are 
below. I've tried to extract your questions and provide answers just to reduce 
the clutter.

1. You might want to provide statistical justification for the otherwise 
puzzling sampling strategy. 

I assume you mean my overall process of random sampling from a large data set. 
The data set is comprised of observations collected over four years. Although 
the basis for sampling would make a good four-frame Dilbert cartoon if it could 
be condensed enough, my answer begins with the unfortunate truth that there is 
a great divide between the technical and marketing groups at the business where 
I am employed. Many powerful marketing executives, some with technical 
backgrounds, feel that there is something fundamentally wrong with the 
manufacturing process because the data generated over the long term is not 
approximately normally distributed. My task was to examine this set of data, 
trying to keep the representation of Y, N and F approximately equal in the 
sample when compared to the large set, to determine if any subset exhibits the 
holy grail-like normal distribution characteristics. I don't feel that this is 
statistical justification, but it is the
 reason why I am doing this.

2. It would help if you explained what you are attempting here in ordinary 
English. There are 10 elements in mysamples, each of which is a 100 x 5 
dataframe, and mat is just one 100 x 5 matrix, which you seem to be referencing 
incorrectly, given the fact that it has two, rather than one, dimension. 
Furthermore, those dataframes may not be of a uniform class, since you said you 
had character variable. Do you really want these all in a character type 
matrix, which would be what is likely to happen given R's requirement that 
matrix element be of only one class? What you say above suggests not.

It seems from your response that I incorrectly assumed that a list is not the 
same as a data frame. I started down this path after reading the questions and 
answers to a similar problem where the r-help responder suggested a two step 
process and said that the list must be converted to another form in order to be 
available for analysis. 

And you are absolutely correct that I do not want each sample in a character 
type matrix. 

In plain English, I hope, I am simply trying to iterate the process of removing 
random samples from the large data set, and then saving these samples in a 
format that is available for simple analysis. For example, if I remove five 
hundred mysample sets, each of which is composed of a 100 x 5 sample of the 
large data set I am interested in determining the skewness, kurtosis, mean and 
standard deviation of each of the four numeric variables in each of the five 
hundred mysample sets.

3. Sorting out such problems is best done with smaller test objects. I was 
surprised to see...type character.

I agree. I began to do this with a small test data set but it was late last 
evening and I realized that I should ask for help before proceeding on what I 
thought might be incorrect assumptions. I clearly misunderstood that a list 
needed to be converted to a data frame in order to be available for analysis. 

Thank you for taking the time to respond. The discussion and suggestions are 
very helpful. 

Adam

________________________________
From: David Winsemius <dwinsem...@comcast.net>

Cc: r-help@r-project.org
Sent: Mon, December 21, 2009 11:23:43 AM
Subject: Re: [R] Question About Repeat Random Sampling from a Data Frame

On Dec 21, 2009, at 10:12 AM, Adam Carr wrote:

> Good Morning:
> 
> I've read many, many posts on the r-help system and I feel compelled to 
> quickly admit that I am relatively new to R, I do have several reference 
> books around me, but I cannot count myself among the fortunate who seem to 
> strong programming intuition.
> 
> I have a data set consisting of 1637 observations of five variables: tensile 
> strength, yield strength, elongation, hardness and a character indicator with 
> three levels: (Y)es, (N)o, and (F)ail.
> 
> My objective is to randomly sample various subsets from this data set and 
> then evaluate these subsets using simple parameters among them tests for 
> normality, shape and skewness. The data set is ordered by the character 
> variable prior to sampling, and the samples are weighted to mirror 
> representation in an overall, physical process.
> 
> I am sampling the data set using this code:
> 
> sample <- dataset[sample(1:1637, 500, 
> prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace 
> = TRUE),]
> 
> What I would like to do is iterate this process to create many (say 500 or 
> more) sampled sets of n=500 and then evaluate each set for the parameters of 
> interest. I would actually be evaluating each variable within each subset for 
> my characteristic of interest. I am familiar with sampling and saving single 
> columns of data to do this sort of thing, but I am not sure how to accomplish 
> this with a multiple-variable data set.
> 
> For example, I am currently iterating this using a clunky process:
> 
> mysamples<-list()
> for (i in 1:10){
> mysamples[[i]] <- dataset[ 
> sample(1:1637,100,prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace
>  = TRUE), ]
> }
> 

Using lists to store intermediate results is not considered clunky in R. (You 
might want to provide statistical justification for the otherwise puzzling 
sampling strategy.)

> But this leaves me with the additional task of defining each mysample[i] 
> iteration and converting it to a form on which I can apply a standard 
> statistical test like mean() or skewness() to the variable columns within 
> each subset. I have attempted to iteratively convert these lists using this 
> code:
> 
> mat<-matrix(nrow=100,ncol=5)
> for (i in 1:length(mysamples))
> {mat[i]<-do.call('rbind',mysamples[i])}

It would help if you explained what you are attempting here in ordinary 
English. There are 10 elements in mysamples, each of which is a 100 x 5 
dataframe, and mat is just one 100 x 5 matrix, which you seem to be referencing 
incorrectly, given the fact that it has two, rather than one, dimension. 
Furthermore, those dataframes may not be of a uniform class, since you said you 
had character variable. Do you really want these all in a character type 
matrix, which would be what is likely to happen given R's requirement that 
matrix element be of only one class? What you say above suggests not.

> 
> but running the code generates the error message: number of items to replace 
> is not a multiple of replacement length.

Because of the way you are referencing the matrix, probably. If you wanted a 10 
x 100 x 5 array, then create an array. In R, as far as I can tell anyway, 
matrices are necessarily of 2 dimensions. Tables and arrays can be of higher 
dimension.

> I have tried unsuccessfully, by reading many, many helpful r-help emails on 
> this error, to understand my probably obvious mistake.

Sorting out such problems is best done with smaller test objects. I was 
surprised to see that you thought it was necessary to convert dataframes to 
matrices in order to calculate descriptive statistics. Nothing could be farther 
from the truth. Furthermore, it for some other more valid reason you wanted a 
list of matrices, there is a perfectly good function that will convert a 
dataframe to a matrix, data.matrix(), remembering of course that if there is a 
single character variable in the dataframe, that the entire matrix will be of 
type character.
> 
> Based on the small amount that I think I know about R it seems to me that 
> sampling the data frame and containing the samples in a list is likely a 
> pretty inefficient way to do this task. Any help that any of you could 
> provide to assist me in iteratively sampling the data frame, and storing the 
> samples in a form on which I can apply other statistical tests would be 
> greatly appreciated.
> 
> Thank you very much for taking the time to consider my questions.
--
David Winsemius, MD
Heritage Laboratories
West Hartford, CT

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Question About Repeat Random Sampling from a Data Frame

Reply via email to