On 15-Sep-04 Brian Mac Namee wrote: > Sorry if this is a rather loing post. I have a simple list of single > feature data points from which I would like to generate a probability > that an unseen point comes from the same distribution. To do this I am > trying to estimate the probability density of the list of points and > use this to generate a probability for the new unseen points. I have > managed to use the R density function to generate the density estimate > but have not been able to do anything with this - i.e. generate a > rpobability that a new point comes from the same distribution. Is > there a function to do this, or am I way off the mark using the > density function at all?
It's not clear what you're really after, but it looks as though you may be wanting to sample from the distribution estimated by 'density'. A possible approach, which you could refine, is exemplified by x<-rnorm(1000) d<-density(x,n=4096) y<-sample(d$x,size=1000,prob=d$y) Check performance with hist(y) Looks OK to me! See "?density" and "?sample". On an alternative interpretation, perhaps you want to first estimate the density based on data you already have, and then when you have got further data (but these would then be "seen" and not "unseen") come to a judgement about whether these new points are compatible with coming from the distributikon you have estimated. A possible approach to this question (again susceptible to refinement) would be as follows. 1. Use a fine-grained grid for 'density', i.e. a large value for "n". 2. Replace each of the points in the new data by the nearest point in this grid. Call these values z1, z2, ... , zk corresponding to index values i1, i2, ... , ik in d$x. 3. Evaluate the probability P(z1,...,zk) from the density as the product of d$y[i] where i<-c(i1,...,ik). Better still, evaluated the logarithm of this. Call the result L. 4. Now simulate a large number of draws of k values from d on the lines of sample(d$x,size=k,prob=d$y) as above, and evaluate L for each of these. Where is the value of L from (3) situated in the distribution of these values of L from (4)? If (say) only 1 per cent of the simulated values of L from "d" are less than the value of L from (3), then you have a basis for a test that your new data did not come from the distribution you have estimated from your old data, in that the new data are from the low-density part of the estimated distribution. There are of course alternative ways to view this question. The value of "k" is relevant. In particular, if "k" is small (say 3 or 4) then the suggestion in (4) is probably the best way to approach it. However, if "k" is large then you can use a test on the lines of Kolmogorov-Smirnov with the reference distribution estimated as the cumulative distribution of d$y and the distribution being tested as the empirical cumulative distribution of your new data. Even sharper focus is available if you are in a position to make a paramatric model for your data, but your description does not suggest that this is the case. Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 167 1972 Date: 15-Sep-04 Time: 15:07:33 ------------------------------ XFMail ------------------------------ ______________________________________________ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
