In responding to Rich, I'll intersperse selected comments with
selected portions of his text and append his entire post below.
To begin with, Rich doesn't way he wants to do a
significance test. He just knows he likes doing such a test.
To wit:
>Given group A and group B, I can do a t-test. Or something.
>That will give me a quantification that I did not have before.
The implication seems to be that any kind of "quantification"
that you did not "have before" is good.
Such a quantification, Rich seems to think, is better than
simply looking at the fundamental characteristics of the numbers,
in a situation where the numbers are representative of the
entire population of interest.
It seems that he is bothered by the notion that we (Hausman and
I) feel that citation counts of 7000 are more impressive than
citation counts of 1500. This is a question of the "utility
scale" of citation counts as indicators of academic quality.
He fails to say how computing a p-level from a test
comparing male and female biologists tells us anything
about this.
But he does have a very special, very personal "minitheory
of measurement" that he is now going to elucidate.
>I am pointing out: Jim claimed that the productivity of the Men was
>impressively greater than that of the women; and that was an act of
>inference on his part. So, his act is screwed up, twice: He does a
>bad deduction / wrong inference (by ignoring p-level -- in this
>instance, apparently ignoring the strong impact of an outlier), and
>then he wrongly claims immunity from the standards of inference.
>That is, he ought NOT to use means when there are huge outliers that
>mess up the t-test; and he ought to find a way to use a p-level for
>support.
>
>I have said this a number of times: if you extract meaning, if you
>make inferences, then you are treating the population as a sample.
Rich claims that my "act is screwed up..."
Why?
Because I didn't compute a p-level!
Sophisticated readers (and even some unsophisticated ones) will
recognize that this is an entirely
circular attack, with a little ad hominem thrown in
for good measure. Here is the form of the argument:
a. S should have done a t-test and gotten a p-level.
b. Why? Because his act is screwed up, because he didn't get a
plevel.
But, Ulrich's not finished. After this entirely circular "argument,"
he then provides a "mini-theory" of measurement.
"I've said this a number of times: if you
make inferences, then you are treating the population as a sample."
Here it is not clear what he means by inferences. But the statement
is clearly false! One CAN look at a small, fixed population. One can,
without using hypothesis testing, make valid and useful inferences
about the status of that population without formal inferential
procedures
Example: "40% of the residents of Puerto Visto died in an avalanche."
"This was a tragedy."
These statements involve going to the village, gathering data, and
recording a result and an interpretation. Now, Rich might say that,
prior to deciding this is a tragedy, I must compute the standard error
of the proportion and test the hypothesis that the percentage of dead
people exceeds that normally found in some reference population. This
would make about as much sense as computing a t-test in the MIT
example.
Now, in some loose sense, my judgment that this is a tragedy
does depend on my knowledge that such things are unusual with
reference to a population. But this is little more than saying
that all human judgment is based on experience.
>That is what we do in science, and what we do in almost any occasion
>where we are publishing for people who are not 'administration.'
>And that is why we seldom use the set of statistics for 'finite
>populations' and why we do use tests of inference.
That is what we do if we don't know any better.
I presented a very simple example of the
"Gorks", a mythical kingdom in which there are only
6 workers, and productivity can be assessed perfectly
on a ratio scale via something called "blurk production".
The productivity figures in my original example are:
Female Male
-------------
93 89.5
92 90
91 90.5
In this kingdom, the women all produce more blurks
than any man, yet they are paid the same. They think they
should be paid more. And of course, under the stipulations
of the problem, they should. But a randomization test
would declare the differences "nonsignificant" and
deny them their raise.
This would seem simple and straight forward. Notice how
Rich manages to make the situation hopelessly complex
by dragging in all sorts of aspects that are both irrelevant
and immaterial.
>In detail: I think that it is a wretched example.
You will note, he offers no reason for calling the example
"wretched."
>I still can't figure out what it is supposed to exemplify.
It is supposed to exemplify the fact that a randomization test
of whether classification (M, F) and blurk productivity are randomly
related is irrelevant to the question of whether, in this
group, Women Gorks are more productive than Male Gorks!
How many times to I have to say this?
>But I can comment on the problem.
>
>===== problem summarized
>Productivity,
>Females: (91, 92, 93)
>Males: (89.5, 90, 90.5)
> Why should not Females be paid more, if that's what matters?
>======
He's summarized the problem. Now, ignoring all substantive
aspects of the problem, and the undeniable fact that these
uniformly more productive Gork women should be paid more,
Ulrich returns to his original premise, and gets lost
in the technicalities.
>Based on a t-test, Females might test as having a higher mean.
>With a few more cases, that difference would be 'significant' with
>either parametric or rank-testing.
Yes, we know that, but it is irrelevant.
>
>But if the natural meaning of production is being used,
>then there would be a natural zero, and one should OBSERVE:
>all of these scores are confined to a tiny per-cent range.
Doesn't matter. The women are more productive, and should
be paid more. How MUCH more can be argued. It would,
in a rational Gork world, depend on several factors,
related to the utility of blurks.
>In fact, the range seems too tiny to be real. Eventually, I
>conclude that I don't understand the mechanism of generating the
>scores, and/ or someone has been 'cooking the books' or faking
>the numbers.
Rich cannot deal with the example! His sole critique
involves (a) the fact that I didn't do what he wants, and
(b) the example is artificial.
Simple artificial examples
often help people who are willing to learn to see fallacies
of reasoning. They can be an excellent tool.
>
>If there were a few more subjects added to each Sex, in
>the same narrow range and pattern, I would conclude that there
>DEFINITELY was something phony going on.
I've already said as a stipulation of the problem. This
is the entire population of the Gorks.
>
>If pay is to be meritocratic, that would seem to justify a TINY
>difference in wages. Nothing about quality. Piece work, I assume.
Yes, if you assume a perfectly linear utility function, the female
Gorks would, indeed, receive a small pay raise, and be making a
slightly higher salary than the male Gorks. But the randomization test
would fail to be significant at the .01 level, and is borderline
at the .05 level! So a randomization
test would say that you *fail to reject* the null hypothesis. Note
that this would still be true if the production were as follows:
Gork example 2. Blurk Production.
Female Male
--------------
100 2
99 1
98 0
-------------
In this example, Female Gorks produce, on average 99 blurks for every
one produced by the men. AND, this is the entire Gork population.
But Rich's "logic" [if he restricts himself to a randomization
test] would state that the Females cannot yet be
declared "significantly different from the men" even though these
are *all the Gorks*. So they cannot get a raise. Rich suggests
that, had we a few more Gorks, we could take a leap and give the
ladies a raise. This will, of course, never happen because
the problem stipulated that there are no more Gorks.
Rich invokes his "immutable law of hypothesis testing" and
denies the women a raise.
Now, Rich might retort that I am wrong, he would do a t-test,
and thereby make the right decision in this case.
But the fundamental error of Rich's conviction about hypothesis
testing would be made most clear if, in fact, there were only 2
Gorks in existence.
Gork example 3. Blurk Production
Female Male
-------------
1000 0
In this case, it seems, Rich would have to deny the woman
a raise since the randomization test doesn't reject, and
the t-test cannot be done. In this two-person society,
women do all the production, and men split the income,
courtesy of statistical rigidity.
--Jim Steiger
--------------------
James H. Steiger, Professor
Department of Psychology
University of British Columbia
Vancouver, B.C., Canada V6T 1Z4
----------------------
Note: I urge all members of this list to read
the following and inform themselves carefully
of the truth about the MIT Report on the Status
of Women Faculty.
Patricia Hausman and James Steiger Article,
"Confession Without Guilt?" :
http://www.iwf.org/news/mitfinal.pdf
Judith Kleinfeld's Article Critiquing the MIT Report:
http://www.uaf.edu/northern/mitstudy/#note9back
Original MIT Report on the Status of Women Faculty:
http://mindit.netmind.com/proxy/http://web.mit.edu/fnl/
On Sun, 18 Feb 2001 19:03:24 -0500, Rich Ulrich <[EMAIL PROTECTED]>
wrote:
>I am going to try to stick to the statistics-related parts, in
>replying to Jim Steiger.
>With a fake user-name, JS wrote on Thu, 15 Feb 2001 17:34:15 GMT,
>[EMAIL PROTECTED] (Irving Scheffe):
>
>JS > "Rich:
> "To be blunt, although your comments in this forum are often
>valuable, you fell far short of two cents worth this time.
> "This is not a popularity contest, it is a statistical argument. "
>
> - I say, if your 'statistical argument' about 'populations' is
>rejected by large (and growing) fraction of all statisticians, then I
>think you do have to go back to defend your textbook, or show how your
>argument differs from what I think it is. That's what I was getting
>at by mentioning textbooks.
>
>
>< snip, verbiage; Jim cited me, RU >
>> > - and if you want to know something about how unlikely it was to
>> >get means that extreme, you can randomize. Do the test.
>JS >
> "a. You do *have* means 'that extreme.'
> "b. There is no 'likelihood' to be considered, because the entire
>population is available. We were assessing the original MIT conjecture
>that to imply there were important performance differences between
>male and female biologists AT MIT would be 'the last refuge of the
>bigot.'"
>
>Given group A and group B, I can do a t-test. Or something.
>That will give me a quantification that I did not have before.
>
>Is such a test interesting? - If I am really in a 'population'
>circumstance, that question can hardly arise; I would know that
>the test tells me nothing. It has nothing to do with taking a vote,
>or providing services to a fixed population.
>
>Why does Jim call some means 'extreme'? - in a theoretical
>'population', you have means that *exist*. Right now, I think that
>it is difficult to justify applying any such adjectives, if you regard
>the set of numbers as a 'population.'
>
>I am pointing out: Jim claimed that the productivity of the Men was
>impressively greater than that of the women; and that was an act of
>inference on his part. So, his act is screwed up, twice: He does a
>bad deduction / wrong inference (by ignoring p-level -- in this
>instance, apparently ignoring the strong impact of an outlier), and
>then he wrongly claims immunity from the standards of inference.
>That is, he ought NOT to use means when there are huge outliers that
>mess up the t-test; and he ought to find a way to use a p-level for
>support.
>
>I have said this a number of times: if you extract meaning, if you
>make inferences, then you are treating the population as a sample.
>That is what we do in science, and what we do in almost any occasion
>where we are publishing for people who are not 'administration.'
>And that is why we seldom use the set of statistics for 'finite
>populations' and why we do use tests of inference.
>
>JS >
> "So, my countercomments to you are:
> "1. Rather than snipping the Gork example, deal with it. Explain,
>in detail, why the Gork women shouldn't be paid more than the men.
>My prediction: you can't, and you won't."
>
>In detail: I think that it is a wretched example.
>I still can't figure out what it is supposed to exemplify.
>But I can comment on the problem.
>
>===== problem summarized
>Productivity,
>Females: (91, 92, 93)
>Males: (89.5, 90, 90.5)
> Why should not Females be paid more, if that's what matters?
>======
>Based on a t-test, Females might test as having a higher mean.
>With a few more cases, that difference would be 'significant' with
>either parametric or rank-testing.
>
>But if the natural meaning of production is being used,
>then there would be a natural zero, and one should OBSERVE:
>all of these scores are confined to a tiny per-cent range.
>In fact, the range seems too tiny to be real. Eventually, I
>conclude that I don't understand the mechanism of generating the
>scores, and/ or someone has been 'cooking the books' or faking
>the numbers.
>
>If there were a few more subjects added to each Sex, in
>the same narrow range and pattern, I would conclude that there
>DEFINITELY was something phony going on.
>
>If pay is to be meritocratic, that would seem to justify a TINY
>difference in wages. Nothing about quality. Piece work, I assume.
>
>Sampling of 3 versus 3 is small N; it is far worse than 6 vs 6.
>
>If this is supposed to be about 'statistical power':
>In the MIT citation data, the "large difference" between M and F
>*would* be significant if there weren't something fishy.
>
>
>< snip, rest. #2 and #3 - (2) seems to have been answered, and
>(3) seems to be a contentious followup to the artificial example that
>I scarcely understand in the first place. >
=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
http://jse.stat.ncsu.edu/
=================================================================