Hi: On Mon, Jul 26, 2010 at 11:36 AM, xin wei <xin...@stat.psu.edu> wrote:
> > hi, this is more a statistical question than a R question. but I do want to > know how to implement this in R. > I have 10,000 data points. Is there any way to generate a empirical > probablity distribution from it (the problem is that I do not know what > exactly this distribution follows, normal, beta?). My ultimate goal is to > generate addition 20,000 data point from this empirical distribution > created > from the existing 10,000 data points. > thank you all in advance. > The problem, it seems to me, is the leap of faith you're taking that the empirical distribution of your manifest sample will serve as a useful data-generating mechanism for the 20,000 future observations you want to take. I would think that, if you intend to take a sample of 20,000 from ANY distribution, you would want some confidence in the specification of said distribution. Even if you don't know exactly what type of population distribution you're dealing with, there are ways to narrow down the set of possibilities. What is the domain/support of the distribution? For example, the Normal is defined on all of R (as in the real numbers, not our favorite statistical programming language), whereas the lognormal, Gamma and Weibull distributions, among others, are defined on the nonnegative reals. The beta distribution is defined on [0, 1]. Therefore, knowledge of the domain is useful in and of itself. Is it plausible that the distribution is symmetric, or should it have a distinct left or right skew? (Similar comments apply to discrete distributions.) Is censoring or truncation a relevant concern? If there is a random process that well describes how the data you observe are generated, that will certainly narrow down the class of potential data-generating mechanisms/distributions. Once you've narrowed down the class of possible distributions as much as possible, you could look into the fitdistr() function in MASS or the fitdistrplus package on CRAN to test out which candidates seem plausible wrt your existing sample and which are not. You are not likely to be able to narrow it down to one family of distributions, but you should have a much better idea about the characteristics of the distribution that gave rise to your sample of 10,000 (assuming, of course, that it is a *random* sample) after going through this exercise, which you can apply to the generation of the next 20,000 observations. OTOH, if your existing 10,000 observations were not produced by some random process, all bets are off. HTH, Dennis > > > -- > View this message in context: > http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2302716.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.