robert wrote: > here the bootstrap test will as well tell us, that the confidence intervall > narrows down by a factor ~sqrt(10) - just the same as if there would be > 10-fold more of well distributed "new" data. Thus this kind of error > estimation has no reasonable basis for data which is not very good.
The confidence intervals narrows when the amount of independent data increases. If you don't understand why, then you lack a basic understanding of statistics. Particularly, it is a fundamental assumption in most statistical models that the data samples are "IDENTICALLY AND INDEPENDENTLY DISTRIBUTED", often abbreviated "i.i.d." And it certainly is assumed in this case. If you tell the computer (or model) that you have i.i.d. data, it will assume it is i.i.d. data, even when its not. The fundamental law of computer science also applies to statistics: shit in = shit out. If you nevertheless provide data that are not i.i.d., like you just did, you will simply obtain invalid results. The confidence interval concerns uncertainty about the value of a population parameter, not about the spread of your data sample. If you collect more INDEPENDENT data, you know more about the population from which the data was sampled. The confidence interval has the property that it will contain the unknown "true correlation" 95% of the times it is generated. Thus if you two samples WITH INDEPENDENT DATA from the same population, one small and one large, the large sample will generate a narrower confidence interval. Computer intensive methods like bootstrapping and asymptotic approximations derived analytically will behave similarly in this respect. However, if you are dumb enough to just provide duplications of your data, the computer is dumb enough to accept that they are obtained statistically independently. In statistical jargon this is called "pseudo-sampling", and is one of the most common fallacies among uneducated practitioners. Statistical software doesn't prevent the practitioner from shooting himself in the leg; it actually makes it a lot easier. Anyone can paste data from Excel into SPSS and hit "ANOVA" in the menu. Whether the output makes any sense is a whole other story. One can duplicate each sample three or four times, and SPSS would be ignorant of that fact. It cannot guess that you are providing it with crappy data, and prevent you from screwing up your analysis. The same goes for NumPy code. The statistical formulas you type in Python have certain assumptions, and when they are violated the output is of no value. The more severe the violation, the less valuable is the output. > The interesting task is probably this: to check for linear correlation but > "weight clumping of data" somehow for the error estimation. If you have a pathological data sample, then you need to specify your knowledge in greater detail. Can you e.g. formulate a reasonable stochastic model for your data, fit the model parameters using the data, and then derive the correlation analytically? I am beginning to think your problem is ill defined because you lack a basic understanding of maths and statistics. For example, it seems you were confusing numerical error (rounding and truncation error) with statistical sampling error, you don't understand why standard errors decrease with sample size, you are testing with pathological data, you don't understand the difference between independent data and data duplications, etc. You really need to pick up a statistics textbook and do some reading, that's my advice. -- http://mail.python.org/mailman/listinfo/python-list