CRIMCOORD transformation in QUEST
Hi, thank you for reading this message. I have the following problems in getting the correct CRIMCOORD transformation of categorical variables in QUEST decision tree algorithm. Your help will be greatly appreciated. Q1: In Loh Shih's paper (Split Selection Models for Classification Trees, Statistica Sinica, 1997, vol 7, p815-840), they mentioned about the mapping from categorical variable to ordered variable via CRIMCOORD. But, their explanation, in particular, step 5 of algorithm 2 is not clear. For example, they wrote Perform a singular value decomposition of the matrix GFU and let a (vector) be the eigenvector (of what?) associated with the largest eigenvalue in step 5. Does this mean a(vector) is the eigenvector of transpose(GFU)*GFU? Q2. I tried to verify the data sets in Table 1. Data set I-III are OK. But, the result for data set IV seems to be incorrect. Could any one of you help me verify that? Thank you very much for your help !! David = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Cauchy PDF + Parameter Estimate
In article [EMAIL PROTECTED], Glen Barnett [EMAIL PROTECTED] wrote: Herman Rubin wrote: In article a5daqb$72k$[EMAIL PROTECTED], Chia C Chong [EMAIL PROTECTED] wrote: Hi! Does anyone come across some Matlab code to estimate the parameters for the Cauchy PDF?? Or some other sources about the method to estimate their parameters?? What is so difficult about maximum likelihood? Start with a reasonable estimator, and use Newton's method. There are difficulties with Newton's method (and many other hill-climbing techniques) because the cauchy likelihood function is generally multimodal. You can end up somewhere other than the MLE unless you use a somewhat more sophisticated starting point than a reasonable estimator. There are good estimators that can start you off very close to the true maximum, but it's a long time since I've seen that literature, so I can't name names right now. The Cauchy likelihood function is frequently multimodal; for large samples for the center with known spread, the probability of unimodal is about .13. However, for reasonable sample sizes, the other modes will be way out, and will be small. For squared error loss, the best translation invariant estimator (the Pitman estimator) can be computed by a closed formula, but I would be concerned about the numerical error if it is not done using considerably higher precision. It can also be done by numerical integration, which is not that difficult. However, I believe that the MLE will be rather good for moderate samples. The local MLE starting with quantile estimates should work quite well. Also, if one knows it is Cauchy, there are estimators using a few quantiles which are close to efficient. -- This address is for information only. I do not claim that these views are those of the Statistics Department or of Purdue University. Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 [EMAIL PROTECTED] Phone: (765)494-6054 FAX: (765)494-0558 = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: detecting outliers in NON normal data ?
In article 00f301c1be68$13413000$fde9e3c8@oemcomputer, Voltolini [EMAIL PROTECTED] wrote: Hi, I would like to know if methods for detecting outliers using interquartil ranges are indicated for data with NON normal distribution. The software Statistica presents this method: data point value UBV + o.c.*(UBV - LBV) data point value LBV - o.c.*(UBV - LBV) where: UBV is the 75th percentile) and LBV is the 25th percentile). o.c. is the outlier coefficient. In the biological world many data are not normally distributed and tests like Rosner, Dixon and Grubbs (if I am wright ! ) are good just for normally distributed data. Nothing is normally distributed; some may come close. But are they even good for normally distributed data? Why should anyone be concerned about outliers? If there are observations produced under the assumed model, they should be included, no matter how far out they are. The only legitimate justification for excluding some data points is that errors of some kind have occurred in producing them, whether they are outliers or inliers. -- This address is for information only. I do not claim that these views are those of the Statistics Department or of Purdue University. Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 [EMAIL PROTECTED] Phone: (765)494-6054 FAX: (765)494-0558 = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Workshop on regression
MAA PREP Workshop in Statistics for Summer 2002 An Introduction to Statistical Methods Based on Regression Dates: June 2 through 7, 2002 Location: Oberlin College Presenters: Richard L. Scheaffer Department of Statistics University of Florida Gainesville, FL 32611 Phone: 352-378-1996 Fax: 352-392-5175 Email: [EMAIL PROTECTED] Jeffrey A. Witmer Mathematics Department Oberlin College King Bldg 205 Oberlin, OH 44074-1019 Phone: (440) 775-8381 Fax: (440) 775-6638 Email: [EMAIL PROTECTED] Overview: Regression, in its many facets, is probably the most widely use statistical methodology in existence. It is the basis of modeling, whether the modeling is directed toward searching for associations among variables in observational studies or establishing treatment differences in designed experiments. The workshop will cover the data analytic techniques appropriate to modern use of regression analysis, as well as the inferential procedures most widely used with this methodology. Beginning with establishing principles and concepts through simple linear regression, the course will build to discussions of multiple regression, including models involving categorical response variables. Regression is an appropriate topic to serve as the basis of a second course in statistics, for those who have taken or taught an introductory course. It need not be calculus based but does rely heavily on statistical software. Outline of workshop content Regression basics Simple linear regression and correlation Multiple regression Regression diagnostics (residuals, influence, leverage) Partial correlation Inference for regression coefficients, individually and in subsets Model selection Logistic regression Analysis of variance models Completely randomized design Randomized block design Repeated measures design Factorial treatment arrangements Sample survey design and analysis Ratio and regression estimators Main reference text for the course: Ramsey, F. and Schafer, D. (2002). The Statistical Sleuth, 2nd ed. Belemont, CA: Duxbury Press. Technology: Regression software will be demonstrated and used by participants. Many illustrative statistics applets and data sets from various sources on the web will be introduced, including those referenced on the American Statistical Associations electronic Journal of Statistics Education. Participants will have access to computers running standard statistical software and will have access to the web. Instructional Format: Believing in active learning, the presenters will provide many opportunities for participants to engage in hands-on activities, both with and without the aid of technology, during the workshop. A reading list will be supplied to the participant in advance of the workshop, and time will be devoted to discussions of this material. Although the emphasis will be on content, there will be opportunities for demonstrating and discussing various pedagogical approaches to teaching regression analysis. Cost: Room and board are provided for all participants through a grant from NSF. Participants must fund their own transportation to and from Oberlin. Applying for Participation: Applications must be made through MAA at http://www.maa.org/pfdev/prep/prep.html Applications should be sent in by March 31, 2002. -- * * Richard L. Scheaffer [EMAIL PROTECTED] * Department of Statistics phone 352-392-1941 (#224) * Box 118545fax 352-392-5175 * University of Florida * Gainesville, FL 32611 * * 907 NW 21 Terrace 352-378-1996 * Gainesville, FL 32603 * = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Means of semantic differential scales
J. Williams wrote: On Mon, 25 Feb 2002 15:17:55 -0800, Jay Tanzman [EMAIL PROTECTED] wrote: I just got chewed out by my boss for modelling the means of some 7-point semantic differential scales. The scales were part of a written, self-administered questionnaire, and were laid out like this: Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful So, why or why not is it kosher to model the means of scales like this? -Jay You can check it out by reading the pioneers of the semantic differential scale. Osgood, Suci, and Tannenbaum are the authors of Measurement of Meaning which now is published in paperback by the University of Illinois Press, Oct. 1990. Thanks. I'll do that. I think one of the above authors also has a website, though, yesterday it crashed my Browser. Then again, my browser was Netscape... It may be your boss is a stickler on what constitutes a true interval scale. Yes, that is it. See my response to Jay Warner for the details. -Jay = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Means of semantic differential scales
2. Perhaps more likely, your boss may have learned (wrongly?) that parametric stats should not be done unless scales of measurement are at least interval in quality. I don't know if his objection was to parametric statistics per se, but he did object to calculating means on these data, which he believes are only ordinal. Search on google for people like John? Gaito and S.S. Stevens and for phrases like scales of measurement and parametric statistics. Thanks. Will do. Or, do an Advanced search with groups.google among the sci.stat.* groups for Stevens, measurement . I think that would find earlier discussions and some references. As I recall it, no one who pretended to know much would have sided with your boss. The firmness of Stevens's categories was strongly challenged by the early 1950s. In particular, there was Frederick Lord's ridiculing parable of the football jerseys. (Naturally, psychology departments taught the subject otherwise, for quite a while longer.) Conover, et al., took a lot of the glory out of 'nonparametric tests' by showing that you can't gain much from rank-order transformations, compared to any decent scaling. That was in an article of 1980 or thereabouts. I may have seen a 'research manual' dated as recent as 1985 that still favored using rank-statistics with Likert-scaled items. I am curious as to what more recent endorsements might exist, in any textbooks at all, or in papers by statisticians. -- Rich Ulrich, [EMAIL PROTECTED] http://www.pitt.edu/~wpilib/index.html = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
What is a qualitative ordinal variable?
Hi, I have a doubt about ordinal variables ! I understand that months (jan., feb., mar.) and level of aggression (low, medium, high) can be accepted as qualitative ordinal variables but my doubt is. What about variables like seed size when using categories like small, medium and large or... level of mutation as rare and frequent ? Is these variables qualitative ? May I use these cases as examples of qualitative and ordinal variables ? I am in doubt because the size of seeds or the frequency of mutations are measurements and counts ! Thanks for any help.. V. _ Prof. J. C. Voltolini Grupo de Estudos em Ecologia de Mamiferos - ECOMAM Universidade de Taubate - Depto. Biologia Praca Marcellino Monteiro 63, Bom Conselho, Taubate, SP - BRASIL. 12030-010 TEL: 0XX12-2254165 (lab.), 2254277 (depto.) FAX: 0XX12-2322947 E-Mail: [EMAIL PROTECTED] http://www.mundobio.rg3.net/ = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
(È«º¸)ÃÖ°È«º¸ÇÁ·Î±×·¥!!È«º¸°ÆÁ¤³¡.
Title: Oº»¸ÞÀÏÀºÁ¤º¸Åë½Å¸ÁÀÌ¿ëÃËÁø¹×Á¤º¸º¸È£µî¿¡°üÇѹý·üÁ¦50Á¶¿¡ÀÇ°ÅÇÑ[±¤°í]¸ÞÀÏÀÔ´Ï´Ù O º» ¸ÞÀÏÀº Á¤º¸Åë½Å¸Á ÀÌ¿ëÃËÁø ¹× Á¤º¸º¸È£ µî¿¡ °üÇÑ ¹ý·ü Á¦ 50Á¶¿¡ ÀÇ°ÅÇÑ [±¤°í] ¸ÞÀÏÀÔ´Ï´ÙO e-mailÁÖ¼Ò´Â ÀÎÅͳݻ󿡼 ÃëµæÇÏ¿´À¸¸ç, ÁÖ¼Ò¿Ü ¾î¶°ÇÑ °³ÀÎ Á¤º¸µµ °¡Áö°í ÀÖÁö ¾Ê½À´Ï´Ù¼ö½Å°ÅºÎ¸¦ ¿øÇÏ½Ã¸é ¾Æ·¡¿¡¼ ¼ö½Å°ÅºÎ ÇØ ÁÖ¼¼¿ä.Á¤º¸¸¦ ¿øÄ¡ ¾Ê´Â ºÐ²²´Â ´ë´ÜÈ÷ ÁË¼Û ÇÕ´Ï´Ù. ¢¿¢¿¢¿ È«º¸ ¶§¹®¿¡ °ÆÁ¤ Çϼ̳ª¿ä? ÀÌÁ¨ °ÆÁ¤ ¸¶¼¼¿ä. ¢¿¢¿¢¿È«º¸¿¡ ´ëÇÑ ¸ðµç°Í°ú ³ëÇÏ¿ì ¿©±â ´Ù ÀÖ½À´Ï´Ù. ¹«¾úÀ̵çÁö ¹°¾î º¸¼¼¿ä. mailto:[EMAIL PROTECTED] ¢º¢º¢º À̹ø¿¡ È«º¸´ëÇà¾÷ À¸·Î ÀüȯÇÔ¿¡µû¶ó 3³âµ¿¾È ¸ð¾Æ³õÀº È«º¸Çñ׷¥À» ¿°°¡·Î ´Ùµå¸²´Ï´Ù.¢¸¢¸¢¸ ¢¾¢¾¢¾ È«º¸Ãʺ¸¿ë ¢¾¢¾¢¾ ¢½À̸áÃßÃâ±â2°³ ¢½À̸áÆíÁý±â1°³ ¢½À̸á¹ß¼Û±â2°³(Á¤Ç°1,µ¥¸ð1) ¢½À̸Ḯ½ºÆ®50¸¸°³ ¢½°Ô½ÃÆǵî·Ï±â1°³ ¢½°Ô½ÃÆǵðDB2000°³ ¢Ñ À§ÀÇ ¸ðµç°ÍÀ» 10¸¸¿ø¿¡ ´Ù µå¸³´Ï´Ù. ¢Ð ¢¾¢¾¢¾ È«º¸Áß±Þ¿ë ¢¾¢¾¢¾ ¢½À̸áÃßÃâ±â3°³ ¢½À̸áÆíÁý±â1°³ ¢½À̸á¹ß¼Û±â3°³(Á¤Ç°2°³,µ¥¸ð1°³)¢½À̸Ḯ½ºÆ®100¸¸°³ ¢½°Ô½ÃÆǵî·Ï±â1°³ ¢½°Ô½ÃÆÇDB5000°³ ¢Ñ À§ÀÇ ¸ðµç°ÍÀ» 20¸¸¿ø¿¡ ´Ù µå¸³´Ï´Ù. ¢Ð ¢¾¢¾¢¾ È«º¸°í±Þ¿ë(1) ¢¾¢¾¢¾ ¢¼¢¼¢¼°³ÀΠȨÆäÁö¿¡ À̸áÃßÃâ,¹ß¼Û±â¸¦ Á÷Á¢ ¼³Ä¡ ÇØ µå¸³´Ï´Ù.¢¼¢¼¢¼ ¢½À̸áÃßÃâ±â´É¢½À̸áÁߺ¹»èÁ¦±â´É¢½¼ö½Å°ÅºÎÀÚµ¿±â´É¢½À̸á¹ß¼Û±â´É¢½¼ö½Å°ÅºÎÀÚÀӽú¸³»±â¢½ÀӽðźÎÀÚ¼ö½Å°ÅºÎÀÚ·Î ¢Ñ¼³Ä¡°¡´ÉÇÑ°÷=ȨÆäÁö¿¡MYSQL°èÁ¤ÀÌ ÀÖ¾î¾ßÇÔÀ¯·áȨÀÌ ¾ø´Â°æ¿ì´Â (200¸Þ°¡,ÀϳâÈ£½ºÆÃ4.4000¿øº°µµÀÓ) ¢Ñ À§ÀÇ ¼³Ä¡¸¦ 20¸¸¿ø¿¡ Çص帳´Ï´Ù. ¢Ð ¢¾¢¾¢¾ È«º¸°í±Þ¿ë(2) ¢¾¢¾¢¾ 1000¸¸°³ À̸Ḯ½ºÆ®¸¦ ¿Ã¸°¼¹ö¸¦ ¸î»ç¶÷¿¡°Ô¸¸ÀÓ´ëÇÔ´Ï´Ù.(±â°£1³â=°¡°Ý100¸¸¿ø)È«º¸ÇÁ·Î±×·¥°ú ¸ðµç ³ëÇϿ츦 ÀüºÎ Àü¼ö ÇÔ´Ï´Ù. ¢Â¢Â¢Â ÀÌ¸á ±¤°í ´ëÇà ¢Â¢Â¢Â ±×µ¿¾È È«º¸ÀÇ ³ëÇÏ¿ì·Î 2³â¿¡ °ÉÃÄ ¹ß¼Û½Ã¼³À» ¿Ïºñ ÇÏ°í6000¸¸°³ÀÇ À̸ᵥÀÌŸ¸¦ ±¸ºñÇÏ¿© À̸áÈ«º¸¸¦ ´ëÇàÇØ µå¸³´Ï´Ù. ¢½¹ß¼Û´É·Â= ½Ã°£´ç 1000¸¸Å뢽À̸áÁߺ¹ ¿ÏÀüÁ¦°Å, »êÀ̸áäũ=½Ã°£´ç40¸¸ÅëÀÌ»ó¢½Å¸ÄÏÀ̸Ḹ ºÐ·ù (Áö¿ª,¼ºº°,¾÷Á¾,³ªÀÌµî ¸ðµç°Í °¡´É) ¢Ñ À̸á¹ß¼Û ´ëÇà °¡°ÝÀº 10¸¸Åë ±âÁØ À¸·Î 10¸¸¿ø À̸ç,»ì¾ÆÀÖ´Â À̸Ḹ äũÇؼ º¸³»¸ç, È¿°ú ¾øÀ¸¸é 100% ȯºÒ ÇÕ´Ï´Ù.¢Ð ÀüÈÆøÁÖ·Î ²ÀÇÊ¿äÇϽźи¸ ¾Æ·¡·Î ¿¬¶ô ÁÖ½Ã¸é ¾È³» Çص帮°Ú½À´Ï´Ù.¢Ñ¹®ÀǸÞÀÏÁֽǰ÷ [EMAIL PROTECTED] ¼ö½Å°ÅºÎ
Re: Means of semantic differential scales
Jay Tanzman wrote: Jay Warner wrote: Jay Tanzman wrote: I just got chewed out by my boss for modelling the means of some 7-point semantic differential scales. The scales were part of a written, self-administered questionnaire, and were laid out like this: Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful So, why or why not is it kosher to model the means of scales like this? -Jay My boss's objection was that he believes categorically (sorry) that semantic differential scales are ordinal. 1)Why do you think the scale is interval data, and not ordinal or categorical? Why would anyone think it is ordinal and not interval? Most of the scales were measuring abstract, subjective constructs, such as empathy and satisfaction, for which there is no underlying physical or biological measurement. Why not, then, _define_ degree of empathy as the subjects' rating on a 1-to-7 scale? Why not indeed?! Of course you can do this - and in fact you are doing this. The question is really - what properties should this variable possess in order that it is meaningful - that is, that it reflects 'reality' meaningfully. If it does not do this, then whatever conclusions you come to about your variable are of no use whatsoever. It is certainly true that your variable is ordinal. Is it more than this? It is extremely unlikely that it is fully numeric (that is, 'interval') because the difference between 1 and 2 is unlikely to have the same meaning as the difference between 4 and 5. You cannot simply define these differences to be equal - you need your variable to reflect reality! However, it is probable that the scale is 'reasonably numeric', so the assumption that the variable is interval may be reasonable. But this will be a model, using a number of assumptions - as all these things are. It is important that you recognise this modelling aspect of your data definition. Regards, Alan -- Alan McLean ([EMAIL PROTECTED]) Department of Econometrics and Business Statistics Monash University, Caulfield Campus, Melbourne Tel: +61 03 9903 2102Fax: +61 03 9903 2007 = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Find PDF of RV with a given mean value
Chia C Chong [EMAIL PROTECTED] wrote in message news:a5g27d$e57$[EMAIL PROTECTED]... Hi! I have a set of random numbers and if I know their expectation/mean, would it be possible to deduce a PDF to describe the distribution of them? Knowing the mean tells you (almost) nothing about the form of the PDF. However, if you are considering a particular family of PDFs (for whatever reason), it should usually be possible to specify the mean (in some cases fixing a parameter, in other cases introducing an equation relating the parameters, so that you can reduce the dimension of the parameter vector by 1). How do I make sure that when I generating these random numbers using the PDF I obtained, it will give me th correct mean/expectation value? It depends on what you mean here - you must be careful to distinguish between the population mean (which you say is known) and the sample mean. If you mean make it so you are generating from a distribution which has the correct population mean, that's taken care of above. If you mean generate so the sample mean is equal to the population mean, why would you want to do that? Consider the mean from n rolls of a (hypothetical) fair six-sided die numbered 1 to 6. If it really is fair, I *know* the population mean is 3.5. Yet the sample mean is almost never 3.5, even though I know the population mean exactly. If I wanted to simulate rolls from this die, I would not try to make the sample mean 3.5. Think on this: Let's assume I want a sample of size 1. To make it have the known mean I have to set it equal to the known mean. Does it come from the right distribution? Not at all! It comes from a distribution with all the probability at the known mean. Now I want to enlarge the sample by adding a second observation. What value will that have? As I keep adding to my sample, I have to keep generating the same value over and over. (There may be some reason you want to generate in such a way that the sample mean is constant, but I doubt it - and you won't be able to have independent observations if you do.) Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =