95% confidence interval
Hello, I am currently taking a first course in statistics, and I was hoping that perhaps someone might be kind enough to answer a question for me. I understand that, while a quantitative variable may not be normally distributed, we may calculate the mean of the sample, and use facts about the Central Limit Theorem, to form a 95% confidence interval for the population mean. As far as I know, this means that in 95/100 samples, the interval will contain the true population mean. This seems very useful at first, but then something begins to confuse me. Yes, we have an interval that may contain the true population mean, but ... if the distribution is heavily skewed to the right, say like income, why do we want an interval for the population mean, when we are taught that the median is a better measure of central tendency for skewed distributions? This is what confuses me. I hope that I have phrased my question in such a way that people can understand what I am saying, and why I am confused. There is just one more thing I would like to get off my chest. My textbook talks about simple random sampling, where you can specify the probability of a sample being selected from the population. Yet, there are examples in the book which deal with conceptual populations, such as the set of all cars of a particular model which may be manufactured in the future. Suppose you have a sample of several of these autos, and you want to find a 95% confidence interval for mean miles/gallon. How is this an SRS when you can't specify the probability of a sample being selected, because the population is conceptual? Perhaps I am simply looking at everything the wrong way, but this is very confusing to me. Any help would be greatly appreciated. ___ Send a cool gift with your E-Card http://www.bluemountain.com/giftcenter/ = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
basic stats question
Hello, I have a question regarding basic probability and statistics. If I understand correctly, the definition of independence holds for two events that are subsets of the same sample space. In other cases, we may need to construct a new sample space, such as with the flipping of a coin twice. Here, we construct the new sample space S=S1xS2={HH,HT,TH,TT}, where Si={H,T} for i=1,2. This way, two events that are independent, such as A="head on first toss" and B="tails on second toss," are subsets of the same sample space. Now, the problem that I have is that, while it is not difficult to construct sample spaces intuitively for textbook problems, it is difficult to do so using these basic definitions of probability. For ex., consider a problem where a manufacturer has five seemingly identical computers, though two are really defective and three are good. An order calls for two of the computers, and we want the probability of the event A="order is filled with two good computers." Intuitively, it is obvious that if D1 and D2 are the bad computers, and G1-G3 are the good computers, then S={D1D2,D1G1,D1G2,D1G3,D2G1,D2G2,D2G3,G1G2,G1G3,G2G3}. Thus, P(A)= 0.30. However, I cannot think of any way of constructing the sample space using definitions like the cartesian product. Perhaps this is because the second computer chosen depends on which computer is chosen first. Yet, another similar problem in my textbook states that the probabilities of a computer being good and defective (from a particular manufacturer) are 0.90 and 0.10, respectively. Then, if we want to test five computers, we may construct the sample space S=S1xS2xS3xS4xS5, where Si={G,D} for i=1,...,5. Hence, if A="all five computers tested are good," P(A)=(0.90)^5. Why is that we can use the Cartesian product in this case but not in the other case? Is it that in the first case we are not performing an experiment, but just sampling? Perhaps I am thinking about this too much, but it would be nice to be able to construct these sample spaces for problems using some sort of formulaic method, as opposed to intuition (perhaps this isn't the right way to view this subject?). Any help would be greatly appreciated. ___ Send a cool gift with your E-Card http://www.bluemountain.com/giftcenter/ = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
probability definition
Hello, I have a question regarding the definition of probability. If I understand correctly, probability may be defined using just axioms. However, my textbook also uses a relative frequency definition, in which a probability is defined as being the proportion of times an outcome occurs in repeated trials of an experiment. This makes sense when one flip of the coin is one trial, and in repeated trials, the proportion of heads is 1/2. But what about a situation (an ex. in my textbook) where the probability of rain tomorrow is 0.70. How do you define this experiment? Perhaps you measure rainfall, temperature, pressure, etc. for each day over a long time period. Then the probability of rain tomorrow is the proportion of times that rain occurred on days with similar values for temp., humidity, etc.? This seems a bit awkard to me. Also, how many trials must one perform an experiment, before you know that the proportion converges to a particular fraction? Any help on interpretation of relative frequency probabilities would be greatly appreciated. In many cases, it seems difficult, at least for textbook examples, to define what the actual experiment is. ___ Send a cool gift with your E-Card http://www.bluemountain.com/giftcenter/ = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
elementary prob./stats concepts
According to a textbook I have, a random sample of n objects from a random variable X, is composed of n random variables itself, namely, X1,X2,...,Xn. I am having some difficulties in figuring out how to interpret this. For example, suppose that you are considering the population of adult males in the U.S., and the random variable is weight. If you take a random sample of n individuals, are the elements of the sample random (prior to observing them, of course) because you might observe something different in another sample due to measurement error? Or perhaps you might get something different if you took the sample at a different time when weight has changed? Also, if the elements of a random sample are random variables themselves, do they have their own parameters, such as mean and standard deviation, as well as their own density functions and cumulative distribution functions? Also, if a statistic is a function of random variables, can a statistic take the form of a density function with a random vector representing the n variables? I know, conceptually, that the sampling distribution of a statistic is purely theoretical and that it represents how a statistic varies from one sample to another. Mathematically, however, I do not understand how to represent this, or if the sampling distribution of a statistic is analogous to the distribution of a random variable which may have a density function. I do not know if these questions even make any sense, but the concepts are fairly confusing to me. Any help would be greatly appreciated. ___ Send a cool gift with your E-Card http://www.bluemountain.com/giftcenter/ = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
normal approx. to binomial
Hello, I have a question regarding the so-called normal approx. to the binomial distribution. According to most textbooks I have looked at (these are undergraduate stats books), there is some talk of how a binomial random variable is approximately normal for large n, and may be approximated by the normal distribution. My question is, are they saying that the sampling distribution of a binomial rv is approximately normal for large n? Typically, a binomial rv is not thought of as a statistic, at least in these books, but this is the only way that the approximation makes sense to me. Perhaps, the sampling distribution of a binomial rv may be normal, kind of like the sampling distribution of x-bar may be normal? This way, one could calculate a statistic from a sample, like the number of successes, and form a confidence interval. Please tell me if this is way off, but when they say that a binomial rv may be normal for large n, it seems like this would only be true if they were talking about a sampling distribution where repeated samples are selected and the number of successes calculated. ___ Send a cool gift with your E-Card http://www.bluemountain.com/giftcenter/ = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
simple linear regression
I have two questions regarding simple linear regression that I was hoping someone could help me with. 1) According to what I have learned so far, the levels of X are fixed, so that only Y is the random variable ( error is random as well). My question is, what if X is a random variable as well? It seems like this could be the case with some of my textbook examples. Does simple model of y=a+bx+e still hold? Are assumptions the same, such as conditional distributions of Y are normal with same variance, E(Y) is a straight line function of X, and independence/normality of error terms? Also, in repeated sampling the sample slope is normal because Y is normal. However, if X also varies from sample to sample, is the sample slope still normally distributed (sampling distribution)? 2) My second question regards the prediction interval. I can perform this on a computer, but it is difficult for me to conceptualize. If you are using Y-hat (the mean of estimated regression function) to estimate a future response, does this mean that the difference, (Y(future response)-Y hat), is a statistic that has a sampling distribution, from which you can derive the standard error? It seems like this might be the case, but there is no parameter. I don't even know if what I just said makes any sense. I understand that my questions are long, and perhaps not in any logical order, but I would greatly appreciate any help with these conceptual matters. Thank you ___ Send a cool gift with your E-Card http://www.bluemountain.com/giftcenter/ = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
semi-studentized residual
Hello, I have a question regarding the so-called semi-studentized residual, which is of the form (e_i)* = ( e_i - 0 ) / sqrt(MSE). Here, e_i is the ith residual, 0 is the mean of the residuals, and sqrt(MSE) means the square root of MSE. Now, if I understand correctly, the population simple linear regression model assumes that the E_i, the error terms, are independent and identically distributed random variables with N(0, sigma^2). My question is, are semi-studentized residuals not fully studentized because MSE is not the variance of all the residuals? It seems like MSE would be the variance of the residuals, unless of course the residuals from the sample data are not independent and identically distributed random variables. If not, each residual may have its own variance, in which case we would have to find this and studentize each residual by its own standard error? I am not sure if I am thinking about this in the right way. Also, if the E_i are iid random variables, does this mean that the observations Y_i are iid random variables within a particular level of X? (I know that in general the Y_i are not iid r.v. since they have different means depending on the level of X). I hope these questions make sense. Thank you for your help. ___ Send a cool gift with your E-Card http://www.bluemountain.com/giftcenter/ = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =