On 12 Mar 2003 05:28:02 -0800, [EMAIL PROTECTED] (Robert J. MacG.
Dawson) wrote:

> Rich Ulrich wrote:
> 
> 
> > google statistics -
> > heteroscedastic  7420    homoscedastic 2900
> > heteroskedastic  7500    homoskedastic 2140
> 
> 
> Sample      X      N  Sample p
> 1        7420  14920  0.497319
> 2        2900   5040  0.575397
> 
> Estimate for p(1) - p(2):  -0.0780778
> 95% CI for p(1) - p(2):  (-0.0939076, -0.0622480)
> Test for p(1) - p(2) = 0 (vs not = 0):  Z = -9.67  P-Value = 0.000
> 
>       Showing a difference in mean usage of between 6% and 9%, statistically
> significant at any p-value you care to name.
> I wonder why?  My best guess  [ break]

Let's call that "nominally, statistically significant."

At first pass in reviewing any google-data, we detect 
enormous redundancies.  The same site shows up
with multiple pages - different versions, apparently, but 
what google shows when it cites 3 lines can be exactly 
the same words.  - The number reported as "about"  is
not count of what google thinks is unique.  If you follow 
to the end of the google report that says "about 62 items 
found,"  it will say something like,  "33 items shown; 
the rest are very similar and you can have them shown 
if you click here for a new search."  (I have done that 
before, in order to get a "text"  version of data that 
I couldn't copy readily from the HTML  version at the
main site.)

And then, further dependency, the same text can be 
quoted at literally hundreds of sites.  

(These particular words don't seem to invoke the foreign
language problems of some comparisons, where one
word in a comparison is a name, or some other legitimate
word -- in German or French or another language having
a moderate number of web sites scanned by google.)

Those are practical arguments about the failure of the 
p-value.  I don't know how heavily they should weigh
for these words.  I do know that I have seen quite a
few google-comparisons,   I would not trust the 
test that is reported above.

Look at the other numbers reported by Robert.
There is no reason that leaps to my mind of why
you get numbers that would be "significantly"  different
when you merely add  -ity  to each word.


> > heteroscedastic  7420    homoscedastic 2900
> > heteroskedastic  7500    homoskedastic 2140

 heteroscedasticity:  24,900  homoscedasticity:  4170
 heteroskedasticity:  19,800  homoskedasticity:  2110

I think that the amount of difference between these is
a better guide to how reliable these counts are. 
For these N's, the effective standard error is probably 5
percentage points, rather than something less than 1.0
that was implied by the t-test.


=== more google trivia
I used groups.google for the latter words, since their
counts were higher, and it returned the biggest margin yet
in favor of hetero-sk,   and the biggest margin for homo-sc.
  500   275
  585   62

These were small enough numbers that I skipped to google's
last page for each word, to see how many hits were left after 
google  "omitted entries very similar  to the above."
The ratios are similar this time.  I think that shrinking by 50%  
is quite a bit more than what I have usually seen.

 236  111
 299   33

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to