Re: Significance Testing, Effect Size, and the Normalized Gain

Rich Ulrich Sun, 19 Oct 2003 14:45:31 -0700

On 15 Oct 2003 10:36:01 -0700, [EMAIL PROTECTED] (Richard Hake)
wrote:

> In a recent post Dennis Roberts (2003a) wrote (slightly edited):
[ snip, citations]
> 
> I note that many education papers even those in physics-education 
> research (PER)!] continue to employ null-hypothesis testing with its 
> "p" values, while eschewing the more widely accepted "effect size" 
> (d),


  "... more widely accepted" -- You wish.
When the CI  for the effect size *almost*  includes zero, 
you don't have much to say about the effect size.   
(1) It is mostly determined by the adequacy of the design. 
(2) It is confused, at times, by the  distinction between 
effect sizes that are 'within'  and sizes that are 'between'.
(3) The users are too shy about saying that, in fact the
upper limit (or sometimes, the point estimate)  is not a 
reasonable estimate.  For instance, I have seen the CI for
an Odds ratio that ran from 1.04  to 25;  

The strongest partisans of the Effect Size are willing to 
rely on it entirely, even when the CI  *does*  include zero.
That position, I think, does not have much support.


>           and (would you believe?) even ignoring the half-century-old 
> "average normalized gain" <g> [Hovland et al. (1949), Gery (1972), 
> Hake (1998a,b; 2002a,b)].
> 
[ snip, various]
> Regarding the half-century-old average normalized gain <g>, in Hake 
> (2003b) I wrote [see that article for the references, bracketed by 
> lines "HHHHHHH. . .  ."):
> 
> HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
> The normalized gain "g" for a treatment is defined [Hovland et al. 
> (1949), Gery (1972), Hake (1998a)] as g = Gain/[Gain (maximum 
> possible). Thus, e.g., if a class averaged 40% on the pretest, and 
> 60% on the posttest then the class-average normalized gain <g> = (60% 
> - 40%)/(100% - 40%) = 20%/60% = 0.33. Ever since the work of Hovland 
> et al. (1949) it's been know by pre/post cognoscente (up until about 
> 1998 probably less than 100 people worldwide)
> that <g> IS A MUCH BETTER INDICATOR OF THE EXTENT TO WHICH A 
> TREATMENT IS EFFECTIVE THAN IS EITHER GAIN OR POSTTEST, for example, 
> if the treatment yields <g> > 0.3 for a mechanics course, then the 
> course could be considered as in the "interactive-engagement zone" 
> (Hake 1998a, Meltzer 2002b).
> 
> Regrettably, the psychology/education/psychometric PEP community [see 
> e.g., Pelligrino et al. (2001); Shavelson & Towne (2001); Fox & 
> Hackermann (2002); Feuer et al. (2002)] remains largely oblivious of 
> PER and the normalized gain. Paraphrasing Lee Schulman, as quoted by 
 [ snip, rest]

Some of us consider that there are serious questions 
of scaling, and that the challenges are not met by our data.  
I tried your sort of scaling on my own, 20-odd years ago, on symptom 
data (Hamilton Rating Scale of depression, especially).  
I sort of figured, going into my exploration,  that the data would be 
too noisy; so I was impressed by the fact that I could judge,
eventually,  that *points*  were a more consistent criterion 
than "percent improvement."  

One patient might have twice the symptoms of another; but
the rates of improvement were comparable in *points*.

Are not areas where that will work better?  - probably.
Are there areas where it will not work as well?  - I am sure.

There's a huge number of problems where there is not an
absolute maximum or minimum,  or where the arbitrary rating
scale that is being used does not extend to that limit.

Another set of data that impressed me *negatively*  about
fractional scoring involved symptoms collected on the 
IMPS (inpatient, serious symptoms).  In the midst of 
big variances which I had not yet explained, I could see 
some big 'fractional'  differences, one group twice the other.  
That sort of difference does concern me.

Then I checked the standardization of the test and 
discovered that the rating scale was 'bottoming out' --
there was almost no patient in the sample who scored 
in the range of pathology.  *I*  don't want the groups 
to test 'different'  when the comparison comes down to 
two-patients-here  versus one-patient-there.

For justification of p-values and testing, I will again 
recommend Robert P. Abelson, "Statistics as 
principled argument."

-- 
Rich Ulrich, [EMAIL PROTECTED]

http://www.pitt.edu/~wpilib/index.html
"Taxes are the price we pay for civilization." 
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: Significance Testing, Effect Size, and the Normalized Gain

Reply via email to