On Thu, Aug 20, 2009 at 7:54 PM, Thomas Dalton <[email protected]>wrote:
> 2009/8/21 Anthony <[email protected]>: > > "Is this article vandalized?" is a yes/no question... > > True, but that isn't actually the question that this research tried to > answer. It tried to answer "How much time has this article spent in a > vandalised state?". "When one downloads a dump file, what percentage of the pages are actually in a vandalized state?" "This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision?" That's the question I was referring to. > If we are only interested in whether the most > recent revision is vandalised then that is a simpler problem but would > require a much larger sample to get the same quality of data. How much larger? Do you know anything about this, or you're just guessing? The number of random samples needed for a high degree of confidence tends to be much much less than most people suspect. That much I know. I found one problem with my use of http://www.raosoft.com/samplesize.html <http://www.raosoft.com/samplesize.html>I was specifying a margin of error of 5%. But that's an absolute margin of error. So if it were 0.2% vandalism, that'd be 0.2% plus or minus 5%. Obviously unacceptable. However, the response distribution would then be 0.2%. This still would require 7649 samples for a 95% confidence plus or minus 0.1%. If the vandalism turned out to be more prevalent though, and I suspect it would, we could for instance be 95% confident plus or minus 0.5% if the response distribution was 0.5% and we had 765 samples. _______________________________________________ foundation-l mailing list [email protected] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
