There are many ways to discretize data. That's one way of looking at
clustering ("vector quantization"). You might also look into modelling
approaches which don't require it: splines, trees, etc. What sort of data
mining are you trying to do?Reid Huntsinger -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of WeiWei Shi Sent: Friday, April 29, 2005 3:22 PM To: bogdan romocea Cc: [email protected] Subject: Re: [R] have to point it out again: a distribution question discretization from continuous domain to categorical one so that some data mining algorithm can be applied on it. Maybe there should be more than 3 categories, I don't know. I googled some papers in financial field, and any more suggestions or references will be helpful. Ed On 4/29/05, bogdan romocea <[EMAIL PROTECTED]> wrote: > > Then, Reid, or other r-gurus, is there a good way to descritize > > the sample into 3 category: 2 tails and the body? > > Out of curiosity, how do you plan to use that information? What would > you do if you knew that the 'body' starts here and ends there? > > > -----Original Message----- > From: WeiWei Shi [mailto:[EMAIL PROTECTED] > Sent: Thursday, April 28, 2005 4:18 PM > To: Huntsinger, Reid > Cc: [email protected] > Subject: Re: [R] have to point it out again: a distribution question > > Here is summary of > l<-qqnorm(kk) # kk is my sample > l$y (which is my sample) > l$x (which is therotical quantile) > diff<-l$y-l$x > > and > > summary(l$y) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.9007 0.9942 0.9998 0.9999 1.0060 1.1070 > > summary(l$x) > Min. 1st Qu. Median Mean 3rd Qu. Max. > -4.145e+00 -6.745e-01 0.000e+00 2.383e-17 6.745e-01 4.145e+00 > > summary(diff) > Min. 1st Qu. Median Mean 3rd Qu. Max. > -3.0380 0.3311 0.9998 0.9999 1.6690 5.0460 > > Comparing diff with l$x, though the 1st Qu. and 3rd Qu. are different, > diff and l$x seem similar to each other, which are proved by > qqnorm(l$x) and qqnorm(diff). > > running the following codes: > > r<-rnorm(1000)+1 # since my sample shift from zero to 1 > qq(r[r>0.9 & r<1.2]) # select the central part > > this gives me a straight line now. > > Thanks for the good explanation for the phenomena. > > Then, Reid, or other r-gurus, is there a good way to descritize the > sample into 3 category: 2 tails and the body? > > Thanks again, > > Weiwei > > On 4/28/05, Huntsinger, Reid <[EMAIL PROTECTED]> wrote: > > Stock returns and other financial data have often found to be heavy-tailed. > > Even Cauchy distributions (without even a first absolute moment) have been > > entertained as models. > > > > Your qq function subtracts numbers on the scale of a normal (0,1) > > distribution from the input data. When the input data are scaled so that > > they are insignificant compared to 1, say, then you get essentially the > > "theoretical quantiles" ie the "x" component of the list back from l$x - > > l$y. l$x is basically a sample from a normal(0,1) distribution so they do > > line up perfectly in the second qqnorm(). Is that what's happening? > > > > Reid Huntsinger > > > > > > -----Original Message----- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] On Behalf Of WeiWei Shi > > Sent: Thursday, April 28, 2005 1:38 PM > > To: Vincent ZOONEKYND > > Cc: [email protected] > > Subject: [R] have to point it out again: a distribution question > > > > Dear R-helpers: > > I pointed out my question last time but it is only partially solved. > > So I would like to point it out again since I think it is very > > interesting, at least to me. > > It is a question not about how to use R, instead it is a kind of > > therotical plus practical question, represented by R. > > > > I came with this question when I built model for some stock returns. > > That's the reason I cannot post the complete data here. But I would > > like to attach some plots here (I zipped them since the original ones > > are too big). > > > > The first plot qq1, is qqnorm plot of my sample, giving me some > > "S"-shape. Since I am not very experienced, I am not sure what kind of > > distribution my sample follows. > > > > The second plot, qq2, is obtained via > > qqnorm(rt(10000, 4)) since I run > > fitdistr(kk, 't') and got > > m s df > > 9.998789e-01 7.663799e-03 3.759726e+00 > > (5.332631e-05) (5.411400e-05) (8.684956e-02) > > > > The second plot seems to say my sample distr follows t-distr. (not sure of > > this) > > > > BTW, what the commands for simulating other distr like log-norm, > > exponential, and so on? > > > > The third one was obtained by running the following R code: > > > > Suppose my data is read into dataset k from file "f392.txt": > > k<-read.table("f392.txt", header=F) # read into k > > kk<-k[[1]] > > qq(kk) > > > > qq function is defined as below: > > qq<-function(dataset){ > > l<-qqnorm(dataset, plot.it=F) > > diff<-l$y-l$x # difference b/w sample and it's therotical quantile > > qqnorm(diff) > > } > > > > The most interesting thing is (if there is not any stupid game here, > > and if my sample follows some kind of distribution (no matter if such > > distr has been found or not)), my qq function seems like a way to > > evaluate it. But what I am worried about, the line is too "perfect", > > which indiates there is something goofy here, which can be proved via > > some mathematical inference to get it. However I used > > qq(rnorm(10000)) > > qq(rt(10000, 3.7) > > qq(rf(....)) > > > > None of them gave me this perfect line! > > > > Sorry for the long question but I want to make it clear to everybody > > about my question. I tried my best :) > > > > Thanks for your reading, > > > > Weiwei (Ed) Shi, Ph.D > > > > On 4/23/05, Vincent ZOONEKYND <[EMAIL PROTECTED]> wrote: > > > If I understand your problem, you are computing the difference between > > > your data and the quantiles of a standard gaussian variable -- in > > > other words, the difference between the data and the red line, in the > > > following picture. > > > > > > N <- 100 # Sample size > > > m <- 1 # Mean > > > s <- 2 # dispersion > > > x <- m + s * rt(N, df=2) # Non-gaussian data > > > > > > qqnorm(x) > > > abline(0,1, col="red") > > > > > > And you get > > > > > > y <- sort(x) - qnorm(ppoints(N)) > > > hist(y) > > > > > > This is probably not the right line (not only because your mean is 1, > > > the slope is wrong as well -- if the data were gaussian, you could > > > estimate it with the standard deviation). > > > > > > You can use the "qqline" function to get the line passing throught the > > > first and third quartiles, which is probably closer to what you have > > > in mind. > > > > > > qqnorm(x) > > > abline(0,1, col="red") > > > qqline(x, col="blue") > > > > > > The differences are > > > > > > x1 <- quantile(x, .25) > > > x2 <- quantile(x, .75) > > > b <- (x2-x1) / (qnorm(.75)-qnorm(.25)) > > > a <- x1 - b * qnorm(.25) > > > y <- sort(x) - (a + b * qnorm(ppoints(N))) > > > hist(y) > > > > > > And you want to know when the differences ceases to be "significantly" > > > different from zero. > > > > > > plot(y) > > > abline(h=0, lty=3) > > > > > > You can use the plot fo fix a threshold, but unless you have a model > > > describing how non-gaussian you data are, this will be empirical. > > > > > > You will note that, in those simulations, the differences (either > > > yours or those from the lines through the first and third quartiles) > > > are not gaussian at all. > > > > > > -- Vincent > > > > > > > > > On 4/22/05, WeiWei Shi <[EMAIL PROTECTED]> wrote: > > > > hope it is not b/c some central limit therory, otherwise my initial > > > > plan will fail :) > > > > > > > > On 4/22/05, WeiWei Shi <[EMAIL PROTECTED]> wrote: > > > > > Hi, r-gurus: > > > > > > > > > > I happened to have a question in my work: > > > > > > > > > > I have a dataset, which has only one dimention, like > > > > > 0.99037297527605 > > > > > 0.991179836732708 > > > > > 0.995635340631367 > > > > > 0.997186769599305 > > > > > 0.991632565640424 > > > > > 0.984047197106486 > > > > > 0.99225943762649 > > > > > 1.00555642128421 > > > > > 0.993725402926564 > > > > > .... > > > > > > > > > > the data is saved in a file called f392.txt. > > > > > > > > > > I used the following codes to play around :) > > > > > > > > > > k<-read.table("f392.txt", header=F) # read into k > > > > > kk<-k[[1]] > > > > > l<-qqnorm(kk) > > > > > diff=c() > > > > > lenk<-length(kk) > > > > > i=1 > > > > > while (i<=lenk){ > > > > > diff[i]=l$y[i]-l$x[i] # save the difference of therotical quantile > > > > > and sample quantile > > > > > # remember, my sample mean is around 1 > > > > > while the therotical one, 0 > > > > > i<-i+1 > > > > > } > > > > > hist(diff, breaks=300) # analyze the distr of such diff > > > > > qqnorm(diff) > > > > > > > > > > my question is: > > > > > from l<-qqnorm(kk), I wanted to know, from which point (or cut), the > > > > > sample points start to become away from therotical ones. That's the > > > > > reason I played around the "diff" list, which gives me the difference. > > > > > To my surprise, the diff is perfectly normal. I tried to use some > > > > > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it must be some > > > > > distribution my sample follows gives this finding. > > > > > > > > > > So, any suggestion on the distribution of my sample? I think there > > > > > might be some mathematical inference which can leads this observation, > > > > > but not quite sure. > > > > > > > > > > btw, > > > > > > fitdistr(kk, 't') > > > > > m s df > > > > > 9.999965e-01 7.630770e-03 3.742244e+00 > > > > > (5.317674e-05) (5.373884e-05) (8.584725e-02) > > > > > > > > > > btw2, can anyone suggest a way to find the "cut" or "threshold" from > > > > > my sample to discretize them into 3 groups: two tail-group and one > > > > > main group.--------- my focus. > > > > > > > > > > Thanks, > > > > > > > > > > Ed > > > > > > > > > > > > > ______________________________________________ > > > > [email protected] mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide! > > > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > ---------------------------------------------------------------------------- -- > > Notice: This e-mail message, together with any attachment...{{dropped}} > > ______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
