CRIMCOORD transformation in QUEST

2002-02-26 Thread David Chang

Hi, thank you for reading this message. I have the following problems in
getting the correct CRIMCOORD transformation of categorical variables
in QUEST decision tree algorithm. Your help will be greatly appreciated.

Q1: In Loh  Shih's paper (Split Selection Models for Classification
Trees, Statistica Sinica, 1997, vol 7, p815-840), they mentioned about
the mapping from categorical variable to ordered variable via CRIMCOORD.
But, their explanation, in particular, step 5 of algorithm 2 is not
clear. For example, they wrote Perform a singular value decomposition
of the matrix GFU and let a (vector) be the eigenvector (of what?)
associated with the largest eigenvalue in step 5. Does this mean
a(vector) is the eigenvector of transpose(GFU)*GFU?

Q2.
I tried to verify the data sets in Table 1. Data set I-III are OK. But,
the result for data set IV seems to be incorrect. Could any one of you
help me verify that?

Thank you very much for your help !!

David



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Cauchy PDF + Parameter Estimate

2002-02-26 Thread Herman Rubin

In article [EMAIL PROTECTED],
Glen Barnett  [EMAIL PROTECTED] wrote:
Herman Rubin wrote:

 In article a5daqb$72k$[EMAIL PROTECTED],
 Chia C Chong [EMAIL PROTECTED] wrote:
 Hi!

 Does anyone come across some Matlab code to estimate the parameters for the
 Cauchy PDF?? Or some other sources about the method to estimate their
 parameters??

 What is so difficult about maximum likelihood?  Start with a
 reasonable estimator, and use Newton's method.

There are difficulties with Newton's method (and many other
hill-climbing
techniques) because the cauchy likelihood function is generally
multimodal.

You can end up somewhere other than the MLE unless you use a somewhat
more
sophisticated starting point than a reasonable estimator. There are
good
estimators that can start you off very close to the true maximum, but
it's 
a long time since I've seen that literature, so I can't name names right
now.

The Cauchy likelihood function is frequently multimodal; for
large samples for the center with known spread, the
probability of unimodal is about .13.  However, for
reasonable sample sizes, the other modes will be way out,
and will be small. 

For squared error loss, the best translation invariant
estimator (the Pitman estimator) can be computed by a
closed formula, but I would be concerned about the 
numerical error if it is not done using considerably
higher precision.  It can also be done by numerical
integration, which is not that difficult.

However, I believe that the MLE will be rather good
for moderate samples.  The local MLE starting with
quantile estimates should work quite well.  Also, if
one knows it is Cauchy, there are estimators using a
few quantiles which are close to efficient.


-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
[EMAIL PROTECTED] Phone: (765)494-6054   FAX: (765)494-0558


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: detecting outliers in NON normal data ?

2002-02-26 Thread Herman Rubin

In article 00f301c1be68$13413000$fde9e3c8@oemcomputer,
Voltolini [EMAIL PROTECTED] wrote:
Hi,

I would like to know if methods for detecting outliers
using interquartil ranges are indicated for data with
NON normal distribution.

The software Statistica presents this method:
data point value  UBV + o.c.*(UBV - LBV)
data point value  LBV - o.c.*(UBV - LBV)

where: UBV is the 75th percentile) and LBV is the 25th percentile).  o.c. is
the outlier coefficient.

In the biological world many data are not normally distributed and tests
like Rosner, Dixon and Grubbs (if I am wright ! ) are good just for normally
distributed data.

Nothing is normally distributed; some may come close.

But are they even good for normally distributed data?  
Why should anyone be concerned about outliers?  If there
are observations produced under the assumed model, they
should be included, no matter how far out they are.  The
only legitimate justification for excluding some data
points is that errors of some kind have occurred in 
producing them, whether they are outliers or inliers.
-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
[EMAIL PROTECTED] Phone: (765)494-6054   FAX: (765)494-0558


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Workshop on regression

2002-02-26 Thread Richard L. Scheaffer

MAA PREP Workshop in Statistics for Summer 2002

An Introduction to Statistical Methods Based on Regression

Dates:  June 2 through 7, 2002

Location:  Oberlin College

Presenters:
Richard L. Scheaffer
Department of Statistics
University of Florida
Gainesville, FL 32611
Phone: 352-378-1996
Fax: 352-392-5175
Email: [EMAIL PROTECTED]

Jeffrey A. Witmer 
Mathematics Department
Oberlin College 
King Bldg 205 
Oberlin, OH 44074-1019 
Phone: (440) 775-8381 
Fax: (440) 775-6638 
Email: [EMAIL PROTECTED]

Overview:
Regression, in its many facets, is probably the most widely use statistical
methodology in existence.  It is the basis of modeling, whether the modeling
is directed toward searching for associations among variables in observational
studies or establishing treatment differences in designed experiments. The
workshop will cover the data analytic techniques appropriate to modern use of
regression analysis, as well as the inferential procedures most widely used
with this methodology.  Beginning with establishing principles and concepts
through simple linear regression, the course will build to discussions of
multiple regression, including models involving categorical response
variables.  Regression is an appropriate topic to serve as the basis of a
second course in statistics, for those who have taken or taught an
introductory course.  It need not be calculus based but does rely heavily on
statistical software. 

Outline of workshop content

Regression basics
Simple linear regression and correlation
Multiple regression
Regression diagnostics (residuals, influence, leverage)
Partial correlation
Inference for regression coefficients, individually and in subsets
Model selection 
Logistic regression
Analysis of variance models
Completely randomized design
Randomized block design
Repeated measures design
Factorial treatment arrangements
Sample survey design and analysis
Ratio and regression estimators

Main reference text for the course:
Ramsey, F. and Schafer, D. (2002). The Statistical Sleuth, 2nd ed. Belemont,
CA: Duxbury Press.

Technology: 
Regression software will be demonstrated and used by participants.  Many
illustrative statistics applets and data sets from various sources on the web
will be introduced, including those referenced on the American Statistical
Association’s electronic Journal of Statistics Education.  Participants will
have access to computers running standard statistical software and will have
access to the web.

Instructional Format:
Believing in active learning, the presenters will provide many opportunities
for participants to engage in hands-on activities, both with and without the
aid of technology, during the workshop.  A reading list will be supplied to
the participant in advance of the workshop, and time will be devoted to
discussions of this material.  Although the emphasis will be on content, there
will be opportunities for demonstrating and discussing various pedagogical
approaches to teaching regression analysis.   

Cost:
Room and board are provided for all participants through a grant from NSF. 
Participants must fund their own transportation to and from Oberlin.

Applying for Participation:
Applications must be made through MAA at 
http://www.maa.org/pfdev/prep/prep.html
Applications should be sent in by March 31, 2002.



-- 
*
* Richard L. Scheaffer  [EMAIL PROTECTED]   
* Department of Statistics  phone 352-392-1941 (#224)   
* Box 118545fax 352-392-5175   
 
* University of Florida 
* Gainesville, FL 32611 
*   
* 907 NW 21 Terrace 352-378-1996
* Gainesville, FL  32603
*


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Means of semantic differential scales

2002-02-26 Thread Jay Tanzman



J. Williams wrote:
 
 On Mon, 25 Feb 2002 15:17:55 -0800, Jay Tanzman [EMAIL PROTECTED]
 wrote:
 
 I just got chewed out by my boss for modelling the means of some 7-point
 semantic differential scales.  The scales were part of a written,
 self-administered questionnaire, and were laid out like this:
 
 Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
 
 So, why or why not is it kosher to model the means of scales like this?
 
 -Jay
 
 You can check it out by reading the pioneers of the semantic
 differential scale.  Osgood, Suci, and Tannenbaum are the authors of
 Measurement of Meaning  which now is published in paperback by the
 University of Illinois Press, Oct. 1990.

Thanks.  I'll do that.  I think one of the above authors also has a website,
though, yesterday it crashed my Browser.  Then again, my browser was Netscape...

 It may be your boss is a
 stickler on what constitutes a true interval scale. 

Yes, that is it.  See my response to Jay Warner for the details.

-Jay


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Means of semantic differential scales

2002-02-26 Thread Rich Ulrich

 
  2. Perhaps more likely, your boss may have learned
  (wrongly?) that parametric stats should not be done unless scales
  of measurement are at least interval in quality.
 
 I don't know if his objection was to parametric statistics per se, but he did
 object to calculating means on these data, which he believes are only ordinal.
 
  Search on google
  for people like John? Gaito and S.S. Stevens and for phrases like
  scales of measurement and parametric statistics.
 
 Thanks.  Will do.
 

Or,  do an Advanced search with  groups.google  
among the  sci.stat.*   groups for  Stevens, measurement .
I think that would find earlier discussions and some references.
As I recall it, no one who pretended to know much would have
sided with your boss.

The firmness of Stevens's  categories was strongly challenged 
by the early 1950s.  In particular, there was Frederick Lord's 
ridiculing parable of the football jerseys.   (Naturally, psychology
departments taught the subject otherwise, for quite a while longer.)

Conover, et al., took a lot of the glory out of 'nonparametric tests'
by showing that you can't gain much from rank-order 
transformations, compared to any decent scaling.  That was 
in an article of 1980 or thereabouts.

I may have seen a 'research manual' dated as recent as 1985
that still  favored using rank-statistics with Likert-scaled items.  
I am curious as to what more recent endorsements might exist,  
in any textbooks at all, or in papers by statisticians.

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



What is a qualitative ordinal variable?

2002-02-26 Thread Voltolini

Hi,

I have a doubt about ordinal variables !

I understand that months (jan., feb., mar.) and level of aggression (low,
medium, high) can be accepted as qualitative ordinal variables but
my doubt is.

What about variables like seed size when using categories like small, medium
and large or... level of mutation as rare and frequent ? Is these variables
qualitative ? May I use these cases as examples of qualitative and ordinal
variables ?

I am in doubt because the size of seeds or the frequency of mutations are
measurements and counts !

Thanks for any help..

V.



_
Prof. J. C. Voltolini
Grupo de Estudos em Ecologia de Mamiferos - ECOMAM
Universidade de Taubate - Depto. Biologia
Praca Marcellino Monteiro 63, Bom Conselho,
Taubate, SP - BRASIL. 12030-010

TEL: 0XX12-2254165 (lab.), 2254277 (depto.)
FAX: 0XX12-2322947
E-Mail: [EMAIL PROTECTED]
http://www.mundobio.rg3.net/




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



(È«º¸)ÃÖ°­È«º¸ÇÁ·Î±×·¥!!È«º¸°ÆÁ¤³¡.

2002-02-26 Thread only1249
Title: Oº»¸ÞÀÏÀºÁ¤º¸Åë½Å¸ÁÀÌ¿ëÃËÁø¹×Á¤º¸º¸È£µî¿¡°üÇѹý·üÁ¦50Á¶¿¡ÀÇ°ÅÇÑ[±¤°í]¸ÞÀÏÀÔ´Ï´Ù




O º» ¸ÞÀÏÀº Á¤º¸Åë½Å¸Á ÀÌ¿ëÃËÁø ¹× Á¤º¸º¸È£ µî¿¡ °üÇÑ ¹ý·ü Á¦ 
50Á¶¿¡ ÀÇ°ÅÇÑ [±¤°í] ¸ÞÀÏÀÔ´Ï´ÙO e-mailÁÖ¼Ò´Â ÀÎÅͳݻ󿡼­ ÃëµæÇÏ¿´À¸¸ç, ÁÖ¼Ò¿Ü ¾î¶°ÇÑ °³ÀÎ Á¤º¸µµ °¡Áö°í ÀÖÁö 
¾Ê½À´Ï´Ù¼ö½Å°ÅºÎ¸¦ ¿øÇÏ½Ã¸é ¾Æ·¡¿¡¼­ ¼ö½Å°ÅºÎ ÇØ ÁÖ¼¼¿ä.Á¤º¸¸¦ ¿øÄ¡ ¾Ê´Â ºÐ²²´Â ´ë´ÜÈ÷ ÁË¼Û ÇÕ´Ï´Ù.

  
  

  ¢¿¢¿¢¿ È«º¸ ¶§¹®¿¡ °ÆÁ¤ Çϼ̳ª¿ä? ÀÌÁ¨ °ÆÁ¤ ¸¶¼¼¿ä. 
  ¢¿¢¿¢¿È«º¸¿¡ ´ëÇÑ ¸ðµç°Í°ú ³ëÇÏ¿ì ¿©±â ´Ù ÀÖ½À´Ï´Ù. ¹«¾úÀ̵çÁö ¹°¾î º¸¼¼¿ä. mailto:[EMAIL PROTECTED]
  ¢º¢º¢º À̹ø¿¡ È«º¸´ëÇà¾÷ À¸·Î ÀüȯÇÔ¿¡µû¶ó 3³âµ¿¾È 
  ¸ð¾Æ³õÀº È«º¸Çñ׷¥À» ¿°°¡·Î ´Ùµå¸²´Ï´Ù.¢¸¢¸¢¸
  

  ¢¾¢¾¢¾ È«º¸Ãʺ¸¿ë ¢¾¢¾¢¾ 
  ¢½À̸áÃßÃâ±â2°³ ¢½À̸áÆíÁý±â1°³ ¢½À̸á¹ß¼Û±â2°³(Á¤Ç°1,µ¥¸ð1) 
  ¢½À̸Ḯ½ºÆ®50¸¸°³ ¢½°Ô½ÃÆǵî·Ï±â1°³ ¢½°Ô½ÃÆǵðDB2000°³
  ¢Ñ À§ÀÇ ¸ðµç°ÍÀ» 10¸¸¿ø¿¡ ´Ù µå¸³´Ï´Ù. 
  ¢Ð
  

  ¢¾¢¾¢¾ È«º¸Áß±Þ¿ë ¢¾¢¾¢¾
  ¢½À̸áÃßÃâ±â3°³ ¢½À̸áÆíÁý±â1°³ 
  ¢½À̸á¹ß¼Û±â3°³(Á¤Ç°2°³,µ¥¸ð1°³)¢½À̸Ḯ½ºÆ®100¸¸°³ ¢½°Ô½ÃÆǵî·Ï±â1°³ ¢½°Ô½ÃÆÇDB5000°³
  ¢Ñ À§ÀÇ ¸ðµç°ÍÀ» 20¸¸¿ø¿¡ ´Ù µå¸³´Ï´Ù. 
  ¢Ð
  

  ¢¾¢¾¢¾ È«º¸°í±Þ¿ë(1) ¢¾¢¾¢¾
  ¢¼¢¼¢¼°³ÀΠȨÆäÁö¿¡ À̸áÃßÃâ,¹ß¼Û±â¸¦ Á÷Á¢ ¼³Ä¡ ÇØ 
  µå¸³´Ï´Ù.¢¼¢¼¢¼
  ¢½À̸áÃßÃâ±â´É¢½À̸áÁߺ¹»èÁ¦±â´É¢½¼ö½Å°ÅºÎÀÚµ¿±â´É¢½À̸á¹ß¼Û±â´É¢½¼ö½Å°ÅºÎÀÚÀӽú¸³»±â¢½ÀӽðźÎÀÚ¼ö½Å°ÅºÎÀÚ·Î
  ¢Ñ¼³Ä¡°¡´ÉÇÑ°÷=ȨÆäÁö¿¡MYSQL°èÁ¤ÀÌ ÀÖ¾î¾ßÇÔÀ¯·áȨÀÌ 
  ¾ø´Â°æ¿ì´Â (200¸Þ°¡,ÀϳâÈ£½ºÆÃ4.4000¿øº°µµÀÓ)
  ¢Ñ À§ÀÇ ¼³Ä¡¸¦ 20¸¸¿ø¿¡ Çص帳´Ï´Ù. 
  ¢Ð
  

  ¢¾¢¾¢¾ È«º¸°í±Þ¿ë(2) ¢¾¢¾¢¾
  1000¸¸°³ À̸Ḯ½ºÆ®¸¦ ¿Ã¸°¼­¹ö¸¦ 
  ¸î»ç¶÷¿¡°Ô¸¸ÀÓ´ëÇÔ´Ï´Ù.(±â°£1³â=°¡°Ý100¸¸¿ø)È«º¸ÇÁ·Î±×·¥°ú ¸ðµç ³ëÇϿ츦 ÀüºÎ Àü¼ö ÇÔ´Ï´Ù.
  

  ¢Â¢Â¢Â ÀÌ¸á ±¤°í ´ëÇà ¢Â¢Â¢Â
  ±×µ¿¾È È«º¸ÀÇ ³ëÇÏ¿ì·Î 2³â¿¡ °ÉÃÄ ¹ß¼Û½Ã¼³À» ¿Ïºñ 
  ÇÏ°í6000¸¸°³ÀÇ À̸ᵥÀÌŸ¸¦ ±¸ºñÇÏ¿© À̸áÈ«º¸¸¦ ´ëÇàÇØ µå¸³´Ï´Ù.
  ¢½¹ß¼Û´É·Â= ½Ã°£´ç 1000¸¸Å뢽À̸áÁߺ¹ ¿ÏÀüÁ¦°Å, 
  »êÀ̸áäũ=½Ã°£´ç40¸¸ÅëÀÌ»ó¢½Å¸ÄÏÀ̸Ḹ ºÐ·ù (Áö¿ª,¼ºº°,¾÷Á¾,³ªÀÌµî ¸ðµç°Í °¡´É) 
  ¢Ñ À̸á¹ß¼Û ´ëÇà °¡°ÝÀº 10¸¸Åë ±âÁØ À¸·Î 10¸¸¿ø 
  À̸ç,»ì¾ÆÀÖ´Â À̸Ḹ äũÇؼ­ º¸³»¸ç, È¿°ú ¾øÀ¸¸é 100% ȯºÒ ÇÕ´Ï´Ù.¢Ð
  

  ÀüÈ­ÆøÁÖ·Î ²ÀÇÊ¿äÇϽźи¸ ¾Æ·¡·Î ¿¬¶ô ÁÖ½Ã¸é ¾È³» 
  Çص帮°Ú½À´Ï´Ù.¢Ñ¹®ÀǸÞÀÏÁֽǰ÷ [EMAIL PROTECTED]
  ¼ö½Å°ÅºÎ




Re: Means of semantic differential scales

2002-02-26 Thread Alan McLean



Jay Tanzman wrote:
 
 Jay Warner wrote:
 
  Jay Tanzman wrote:
 
   I just got chewed out by my boss for modelling the means of some 7-point
   semantic differential scales.  The scales were part of a written,
   self-administered questionnaire, and were laid out like this:
  
   Not stressful 1__ 2__ 3__ 4__ 5__ 6__ 7__ Very stressful
  
   So, why or why not is it kosher to model the means of scales like this?
  
   -Jay
 
 My boss's objection was that he believes categorically (sorry) that semantic
 differential scales are ordinal.
 
  1)Why do you think the scale is interval data, and not ordinal or
  categorical?
 
 Why would anyone think it is ordinal and not interval?  Most of the scales were
 measuring abstract, subjective constructs, such as empathy and satisfaction, for
 which there is no underlying physical or biological measurement.  Why not, then,
 _define_ degree of empathy as the subjects' rating on a 1-to-7 scale?
 

Why not indeed?! Of course you can do this - and in fact you are doing
this. The question is really - what properties should this variable
possess in order that it is meaningful - that is, that it reflects
'reality' meaningfully. If it does not do this, then whatever
conclusions you come to about your variable are of no use whatsoever.

It is certainly true that your variable is ordinal. Is it more than
this? It is extremely unlikely that it is fully numeric (that is,
'interval') because the difference between 1 and 2 is unlikely to have
the same meaning as the difference between 4 and 5. You cannot simply
define these differences to be equal - you need your variable to reflect
reality! However, it is probable that the scale is 'reasonably numeric',
so the assumption that the variable is interval may be reasonable. But
this will be a model, using a number of assumptions - as all these
things are. 

It is important that you recognise this modelling aspect of your data
definition.

Regards,
Alan





-- 
Alan McLean ([EMAIL PROTECTED])
Department of Econometrics and Business Statistics
Monash University, Caulfield Campus, Melbourne
Tel:  +61 03 9903 2102Fax: +61 03 9903 2007



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Find PDF of RV with a given mean value

2002-02-26 Thread Glen

Chia C Chong [EMAIL PROTECTED] wrote in message 
news:a5g27d$e57$[EMAIL PROTECTED]...
 Hi!
 
 I have a set of random numbers and if I know their expectation/mean, would
 it be possible to deduce a PDF to describe the distribution of them? 

Knowing the mean tells you (almost) nothing about the form of the PDF.

However, if you are considering a particular family of PDFs (for
whatever reason), it should usually be possible to specify the mean
(in some cases fixing a parameter, in other cases introducing an
equation relating the parameters, so that you can reduce the dimension
of the parameter vector by 1).

 How do
 I make sure that when I generating these random numbers using the PDF I
 obtained, it will give me th correct mean/expectation value?

It depends on what you mean here - you must be careful to distinguish
between the population mean (which you say is known) and the sample
mean.

If you mean make it so you are generating from a distribution which
has the correct population mean, that's taken care of above.

If you mean generate so the sample mean is equal to the population
mean, why would you want to do that?

Consider the mean from n rolls of a (hypothetical) fair six-sided die
numbered 1 to 6. If it really is fair, I *know* the population mean is
3.5. Yet the sample mean is almost never 3.5, even though I know the
population mean exactly. If I wanted to simulate rolls from this die,
I would not try to make the sample mean 3.5.

Think on this: Let's assume I want a sample of size 1. To make it have
the known mean I have to set it equal to the known mean. Does it come
from the right distribution? Not at all! It comes from a distribution
with all the probability at the known mean. Now I want to enlarge the
sample by adding a second observation. What value will that have? As I
keep adding to my sample, I have to keep generating the same value
over and over.

(There may be some reason you want to generate in such a way that the
sample mean is constant, but I doubt it - and you won't be able to
have independent observations if you do.)

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=