On 12 Mar 2003 05:28:02 -0800, [EMAIL PROTECTED] (Robert J. MacG. Dawson) wrote:
> Rich Ulrich wrote: > > > > google statistics - > > heteroscedastic 7420 homoscedastic 2900 > > heteroskedastic 7500 homoskedastic 2140 > > > Sample X N Sample p > 1 7420 14920 0.497319 > 2 2900 5040 0.575397 > > Estimate for p(1) - p(2): -0.0780778 > 95% CI for p(1) - p(2): (-0.0939076, -0.0622480) > Test for p(1) - p(2) = 0 (vs not = 0): Z = -9.67 P-Value = 0.000 > > Showing a difference in mean usage of between 6% and 9%, statistically > significant at any p-value you care to name. > I wonder why? My best guess [ break] Let's call that "nominally, statistically significant." At first pass in reviewing any google-data, we detect enormous redundancies. The same site shows up with multiple pages - different versions, apparently, but what google shows when it cites 3 lines can be exactly the same words. - The number reported as "about" is not count of what google thinks is unique. If you follow to the end of the google report that says "about 62 items found," it will say something like, "33 items shown; the rest are very similar and you can have them shown if you click here for a new search." (I have done that before, in order to get a "text" version of data that I couldn't copy readily from the HTML version at the main site.) And then, further dependency, the same text can be quoted at literally hundreds of sites. (These particular words don't seem to invoke the foreign language problems of some comparisons, where one word in a comparison is a name, or some other legitimate word -- in German or French or another language having a moderate number of web sites scanned by google.) Those are practical arguments about the failure of the p-value. I don't know how heavily they should weigh for these words. I do know that I have seen quite a few google-comparisons, I would not trust the test that is reported above. Look at the other numbers reported by Robert. There is no reason that leaps to my mind of why you get numbers that would be "significantly" different when you merely add -ity to each word. > > heteroscedastic 7420 homoscedastic 2900 > > heteroskedastic 7500 homoskedastic 2140 heteroscedasticity: 24,900 homoscedasticity: 4170 heteroskedasticity: 19,800 homoskedasticity: 2110 I think that the amount of difference between these is a better guide to how reliable these counts are. For these N's, the effective standard error is probably 5 percentage points, rather than something less than 1.0 that was implied by the t-test. === more google trivia I used groups.google for the latter words, since their counts were higher, and it returned the biggest margin yet in favor of hetero-sk, and the biggest margin for homo-sc. 500 275 585 62 These were small enough numbers that I skipped to google's last page for each word, to see how many hits were left after google "omitted entries very similar to the above." The ratios are similar this time. I think that shrinking by 50% is quite a bit more than what I have usually seen. 236 111 299 33 -- Rich Ulrich, [EMAIL PROTECTED] http://www.pitt.edu/~wpilib/index.html . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
