Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
2009/8/21 Anthony wikim...@inbox.org: If we are only interested in whether the most recent revision is vandalised then that is a simpler problem but would require a much larger sample to get the same quality of data. How much larger? Do you know anything about this, or you're just guessing? The number of random samples needed for a high degree of confidence tends to be much much less than most people suspect. That much I know. I have a Masters degree in Mathematics, so I know a little about the subject. (I didn't study much statistics, but you can't do 4 years of Maths at Uni without getting some basic understanding of it.) You say it requires 7649 articles, which sounds about right to me. If we looked through the entire history (or just the last year or 6 months or something if you want just recent data) then we could do it with significantly fewer articles. I'm not sure how many we would need, though. I think we need to know what the distribution is for how long a randomly chosen article spends in a vandalised state before we can work out what the distribution of the average would be. My statistics isn't good enough to even work out what kind of distribution it is likely to be, I certainly can't guess at the parameters. It obviously ranges between 0% and 100% with the mean somewhere close to 0% (0.4% seems like a good estimate) and will presumably have a long tail (truncated at 100%) - there are articles that spend their entire life in a vandalised state (attack pages, for example) and there is a chance we'll completely miss such a page and it will last the entire length of the survey period, so the probability density at 100% won't be 0. I'm sure there is a distribution that satisfies those requirements, but I don't know what it is. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
[Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
I am supposed to be taking a wiki-vacation to finish my PhD thesis and find a job for next year. However, this afternoon I decided to take a break and consider an interesting question recently suggested to me by someone else: When one downloads a dump file, what percentage of the pages are actually in a vandalized state? This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision? Understanding what fraction of Wikipedia is vandalized at any given instant is obviously of both practical and public relations interest. In addition it bears on the motivation for certain development projects like flagged revisions. So, I decided to generate a rough estimate. For the purposes of making an estimate I used the main namespace of the English Wikipedia and adopted the following operational approximations: I considered that vandalism is that thing which gets reverted, and that reverts are those edits tagged with revert, rv, undo, undid, etc. in the edit summary line. Obviously, not all vandalism is cleanly reverted, and not all reverts are cleanly tagged. In addition, some things flagged as reverts aren't really addressing what we would conventionally consider to be vandalism. Such caveats notwithstanding, I have had some reasonable success with using a revert heuristic in the past. With the right keywords one can easily catch the standardized comments created by admin rollback, the undo function, the revert bots, various editing tools, and commonly used phrases like rv, rvv, etc. It won't be perfect, but it is a quick way of getting an automated estimate. I would usually expect the answer I get in this way to be correct within an order of magnitude, and perhaps within a factor of a few, though it is still just a crude estimate. I analyzed the edit history up to the mid-June dump for a sample 29,999 main namespace pages (sampling from everything in main including redirects). This included 1,333,829 edits, from which I identified 102,926 episodes of reverted vandalism. As a further approximation, I assumed that whenever a revert occurred, it applied to the immediately preceding edit and any additional consecutive changes by the same editor (this is how admin rollback operates, but is not necessarily true of tools like undo). With those assumptions, I then used the timestamps on my identified intervals of vandalism to figure out how much time each page had spent in a vandalized state. Over the entire history of Wikipedia, this sample of pages was vandalized during 0.28% of its existence. Or, more relevantly, focusing on just this year vandalism was present 0.21% of the time, which suggests that one should expect 0.21% of mainspace pages in any recent enwiki dump will be in a vandalized state (i.e. 1 in 480). (Note that since redirects represent 55% of the main namespace and are rarely vandalized, one could argue that 0.37% [1 in 270] would be a better estimate for the portion of actual articles that are in a vandalized condition at any given moment.) I also took a look at the time distribution of vandalism. Not surprisingly, it has a very long tail. The median time to revert over the entire history is 6.7 minutes, but the mean time to revert is 18.2 hours, and my sample included one revert going back 45 months (though examples of such very long lags also imply the page had gone years without any edits, which would imply an obscure topic that was also almost never visited). In the recent period these factors becomes 5.2 minutes and 14.4 hours for the median and mean respectively. The observation that nearly 50% of reverts are occurring in 5 minutes or less is a testament to the efficient work of recent changes reviewers and watchlists. Unfortunately the 5% of vandalism that persists longer than 35 hours is responsible for 90% of the actual vandalism a visitor is likely to encounter at random. Hence, as one might guess, it is the vandalism that slips through and persists the longest that has the largest practical effect. It is also worth noting that the prevalence figures for February-May of this year are slightly lower than at any time since 2006. There is also a drop in the mean duration of vandalism coupled to a slight increase in the median duration. However, these effects mostly disappear if we limit our considerations to only vandalism events lasting 1 month or shorter. Hence those changes may be in significant part linked to cut-off biasing from longer-term vandalism events that have yet to be identified. The ambiguity in the change from earlier in the year is somewhat surprising as the AbuseFilter was launched in March and was intended to decrease the burden of vandalism. One might speculate that the simple vandalism amenable to the AbuseFilter was already being addressed quickly in nearly all cases and hence its impact on the persistence of vandalism may already have been fairly limited. I've
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
Robert, thanks for this. I have long wanted that number: it is really interesting. -Original Message- From: Robert Rohde raro...@gmail.com Date: Thu, 20 Aug 2009 03:06:06 To: Wikimedia Foundation Mailing Listfoundation-l@lists.wikimedia.org; English Wikipediawikie...@lists.wikimedia.org Cc: Sean Moss-Pultzs...@openmoko.com; s...@parc.com Subject: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles I am supposed to be taking a wiki-vacation to finish my PhD thesis and find a job for next year. However, this afternoon I decided to take a break and consider an interesting question recently suggested to me by someone else: When one downloads a dump file, what percentage of the pages are actually in a vandalized state? This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision? Understanding what fraction of Wikipedia is vandalized at any given instant is obviously of both practical and public relations interest. In addition it bears on the motivation for certain development projects like flagged revisions. So, I decided to generate a rough estimate. For the purposes of making an estimate I used the main namespace of the English Wikipedia and adopted the following operational approximations: I considered that vandalism is that thing which gets reverted, and that reverts are those edits tagged with revert, rv, undo, undid, etc. in the edit summary line. Obviously, not all vandalism is cleanly reverted, and not all reverts are cleanly tagged. In addition, some things flagged as reverts aren't really addressing what we would conventionally consider to be vandalism. Such caveats notwithstanding, I have had some reasonable success with using a revert heuristic in the past. With the right keywords one can easily catch the standardized comments created by admin rollback, the undo function, the revert bots, various editing tools, and commonly used phrases like rv, rvv, etc. It won't be perfect, but it is a quick way of getting an automated estimate. I would usually expect the answer I get in this way to be correct within an order of magnitude, and perhaps within a factor of a few, though it is still just a crude estimate. I analyzed the edit history up to the mid-June dump for a sample 29,999 main namespace pages (sampling from everything in main including redirects). This included 1,333,829 edits, from which I identified 102,926 episodes of reverted vandalism. As a further approximation, I assumed that whenever a revert occurred, it applied to the immediately preceding edit and any additional consecutive changes by the same editor (this is how admin rollback operates, but is not necessarily true of tools like undo). With those assumptions, I then used the timestamps on my identified intervals of vandalism to figure out how much time each page had spent in a vandalized state. Over the entire history of Wikipedia, this sample of pages was vandalized during 0.28% of its existence. Or, more relevantly, focusing on just this year vandalism was present 0.21% of the time, which suggests that one should expect 0.21% of mainspace pages in any recent enwiki dump will be in a vandalized state (i.e. 1 in 480). (Note that since redirects represent 55% of the main namespace and are rarely vandalized, one could argue that 0.37% [1 in 270] would be a better estimate for the portion of actual articles that are in a vandalized condition at any given moment.) I also took a look at the time distribution of vandalism. Not surprisingly, it has a very long tail. The median time to revert over the entire history is 6.7 minutes, but the mean time to revert is 18.2 hours, and my sample included one revert going back 45 months (though examples of such very long lags also imply the page had gone years without any edits, which would imply an obscure topic that was also almost never visited). In the recent period these factors becomes 5.2 minutes and 14.4 hours for the median and mean respectively. The observation that nearly 50% of reverts are occurring in 5 minutes or less is a testament to the efficient work of recent changes reviewers and watchlists. Unfortunately the 5% of vandalism that persists longer than 35 hours is responsible for 90% of the actual vandalism a visitor is likely to encounter at random. Hence, as one might guess, it is the vandalism that slips through and persists the longest that has the largest practical effect. It is also worth noting that the prevalence figures for February-May of this year are slightly lower than at any time since 2006. There is also a drop in the mean duration of vandalism coupled to a slight increase in the median duration. However, these effects mostly disappear if we limit our considerations to only vandalism events lasting 1 month or shorter. Hence those changes may be in significant part linked to cut-off biasing from longer-term vandalism events
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 6:06 AM, Robert Rohderaro...@gmail.com wrote: [snip] When one downloads a dump file, what percentage of the pages are actually in a vandalized state? Although you don't actually answer that question, you answer a different question: [snip] approximations: I considered that vandalism is that thing which gets reverted, and that reverts are those edits tagged with revert, rv, undo, undid, etc. in the edit summary line. Obviously, not all vandalism is cleanly reverted, and not all reverts are cleanly tagged. Which is interesting too, but part of the problem with calling this a measure of vandalism is that it isn't really, and we don't really have a good handle on how solid an approximation it is beyond gut feelings and arm-waving. The study of Wikipedia activity is a new area of research, not something that has been studied for decades. Not only do we not know many things about Wikipedia, but we don't know many things about how to know things about Wikipedia. There must be ways to get a better understanding, but we many not know of them and the ones we do know of are not always used. For example, we could increase our confidence in this type of proxy-measure by taking a random subset of that data and having humans classify it based on some agreed-on established criteria. By performing the review process many times we could get a handle on the typical error of both the proxy-metric and the meta-review. The risk here is that people will misunderstand these shorthand metrics as the real-deal and the risk is increased when we encourage it by using language which suggests that the simplistic understanding is the correct one. IMO, highly uncertain and/or outright wrong information is worse than not knowing when you aren't aware of the reliability of the information. We can't control how the press chooses to report on research, but when we actively encourage misunderstandings by playing up the significance or generality of our research our behaviour is unethical. Vigilance is required. This risk of misinformation is increased many-fold in comparative analysis, where factors like time are plotted against indicators because we often miss confounding variables (http://en.wikipedia.org/wiki/Confounding). Stepping away from your review for a moment, because it wasn't primarily a comparative one, I'd like to point out some general points: For example, If research finds that edits are more frequently reverted over time is this because there has been a change in the revision decision process or have articles become better and more complete over time and have edits to long and high quality articles always been more likely to be reverted? Both are probably true, but how does the contribution break down? There are many other possibly significant confounding variables. Probably many more than any of us have thought of yet. I've always been of the school of thought that we do research to produce understanding, not just generate numbers and Wikipedia becomes more complete over time, less work for new people to do is a pretty different understanding from Wikipedia increasing hostile towards new contributors are pretty different understandings but both may be supported by the same data at least until you've controlled for many factors. Another example— because of the scale of Wikipedia we must resort to proxy-metrics. We can't directly measure vandalism, but we can measure how often someone adds is gay over time. Proxy-metrics are powerful tools but can be misleading. If we're trying to automatically identify vandalism for a study (either to include it or exclude it) we have the risk that the vandals are adapting to automatic identification: If you were using is gay as a measure of vandalism over time you might conclude that vandalism is decreasing when in reality cluebot is performing the same kind of analysis for its automatic vandalism suppression and the vandals have responded by vandalizing in forms that can't be automatically identified, such as by changing dates to incorrect values. Or, keeping the goal of understanding in mind, sometimes the measurements can all be right but a lack of care and consideration can still cause people to draw the wrong conclusions. For example, English Wikipedia has adopted a much stronger policy about citations in articles about living people than it once had. It is *intentionally* more difficult to contribute to those articles especially for new contributors who do not know the rules then it once was. Going back to your simple study now: The analysis of vandalism duration and its impact on readers makes an assumption about readership which we know to be invalid. You're assuming a uniform distribution of readership: That readers are just as likely to read any random article. But we know that the actual readership follows a power-law (long-tail) distribution. Because of the failure to consider traffic levels we can't draw conclusions on how
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 12:06 PM, Robert Rohderaro...@gmail.com wrote: Given the nature of the approximations I made in doing this analysis I suspect it is more likely that I have somewhat underestimated the vandalism problem rather than overestimated it, but as I said in the beginning I'd like to believe I am in the right ballpark. If that's true, I personally think that having less than 0.5% of Wikipedia be vandalized at any given instant is actually rather comforting. It's not a perfect number, but it would suggest that nearly everyone still gets to see Wikipedia as intended rather than in a vandalized state. (Though to be fair I didn't try to figure out if the vandalism occurred in more frequently visited parts or not.) Thanks for the excellent analysis, Robert. Just to give an idea of what 0.4% means in practice, you can think in terms of one country, 12 US counties, 33 Italian municipalities, 147 French municipalities or 1 Pope Cruccone ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
Robert Rohde wrote: When one downloads a dump file, what percentage of the pages are actually in a vandalized state? This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision? Is there a possibility of re-running the numbers to include traffic weightings? I would hypothesize from experience that if we adjust the random page selection to account for traffic (to get a better view of what people are actually seeing) we would see slightly different results. I think we would see a lot less (percentagewise) vandalism that persists for a really long time for precisely the reason you identified: most vandalism that lasts a long time, lasts a long time because it is on obscure pages that no one is visiting. That doesn't mean it is not a problem, but it does change some thinking about what kinds of tools are needed to deal with that problem. I'm not sure what else would change. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
Gregory Maxwell wrote: If you were using is gay as a measure of vandalism over time you might conclude that vandalism is decreasing when in reality cluebot is performing the same kind of analysis for its automatic vandalism suppression and the vandals have responded by vandalizing in forms that can't be automatically identified, such as by changing dates to incorrect values. And if that's true, that's on net a bad thing. Most is gay vandalism (not all) is just stupid embarassing and it will be obvious to the reader as vandalism, and lots of people get how Wikipedia works and are reasonably tolerant of seeing that sort of thing from time to time. But people expect that we should get the dates right, and they are right to ask that of us. I understand that you're just making up a hypothetical, not saying that this is what is actually happening. I'm just agreeing with this line of thinking that says, in essence, when we think about measuring vandalism, which is already hard enough, we also have to think about how damaging different kinds of vandalism actually are. Greg, I think your email sounded a little negative at the start, but not so much further down. I think you would join me heartily in being super grateful for people doing this kind of analysis. Yes, some of it will be primitive and will suffer from the many difficulties. But data-driven decisionmaking is a great thing, particularly when we are cognizant of the limitations of the data we're using. I just didn't want anyone to get the idea (and I'm sure I'm reading you right) that you were opposed to people doing research. :-) --Jimbo ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
While the time and effort that went into Robert Rohde's analysis is certainly extensive, the outcomes are based on so many flawed assumptions about the nature of vandalism and vandalism reversion, publicize at one's peril the key finding of a 0.4% vandalism rate. http://en.wikipedia.org/w/index.php?title=John_McCaindiff=169808394oldid=169720853 11 hours Reverted with no tags. http://en.wikipedia.org/w/index.php?title=Maria_Cantwelldiff=prevoldid=160400298 46 days Reverted with note: Undid revision 160400298 by 75.133.82.218 By the way, there was a two-minute vandalism in the interim, so in many cases, just because an analyst finds a recent and short incident, he or she may be completely missing a longer-term incident. http://en.wikipedia.org/w/index.php?title=Ted_Stevensdiff=prevoldid=170850508 There goes your rvv theory. In this case, rvv was a flag for even more preposterous vandalism. The notion that these are lightly-watched or lightly-edited articles is a bit difficult to swallow, since they are the biographical articles about three United States senators. These articles were analyzed by an independent team of volunteers, and we found that the 100 senatorial articles were in deliberate disrepair about 6.8% of the time, which would vastly differ from Rohde's analysis. Certainly, one could argue that articles about political figures may be vandalized more often, but one might also counter that argument with the assumption that more eyes ought to be watching these articles and repairing them. More detail here: http://www.mywikibiz.com/Wikipedia_Vandalism_Study Admittedly, there were some minor flaws with our study's methodology, too. These are reviewed on the Discussion page. But, as with Rohde's assessment, if anything, we may have underrepresented the problem at 6.8%. I remain unimpressed with Wikipedia's accuracy rate, and I am bewildered why flagged revisions have not been implemented yet. Greg ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 12:59 PM, Gregory Kohs thekoh...@gmail.com wrote: While the time and effort that went into Robert Rohde's analysis is certainly extensive, the outcomes are based on so many flawed assumptions about the nature of vandalism and vandalism reversion, publicize at one's peril the key finding of a 0.4% vandalism rate. http://en.wikipedia.org/w/index.php?title=John_McCaindiff=169808394oldid=169720853 11 hours Reverted with no tags. The best part about that little exchange is: http://en.wikipedia.org/w/index.php?title=John_McCaindiff=nextoldid=169906715 wherein a revert was made returning the vandalism, followed by another when the editor noticed his error. I don't think Robert made any firm conclusions on the meaning of his data; he notes all the caveats that others have since emphasized, and admits to likely underestimating vandalism. I read the 0.4% as representing the approximate number of articles containing vandalism in an English Wikipedia snapshot; that is quite different than the amount of time specific articles stay in a vandalized state. Given the difficulty of accurately analyzing this sort of data, no firm conclusions can be drawn; but certainly its more informative than a Wikipedia Review analysis of a relatively small group of articles in a specific topic area. Nathan ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 12:46 PM, Jimmy Walesjwa...@wikia-inc.com wrote: [snip] Greg, I think your email sounded a little negative at the start, but not so much further down. I think you would join me heartily in being super grateful for people doing this kind of analysis. Yes, some of it will be primitive and will suffer from the many difficulties. But data-driven decisionmaking is a great thing, particularly when we are cognizant of the limitations of the data we're using. I just didn't want anyone to get the idea (and I'm sure I'm reading you right) that you were opposed to people doing research. :-) Absolutely— No one who has done thing kind of analysis could fail to appreciate the enormous amount of work that goes into even making a couple of simple seemingly off the cuff numbers out of the mountain of data that is Wikipedia. Making sure the numbers are accurate and meaningful while also clearly explaining the process of generating is in and of itself a large amount of work, and my gratitude is extended to anyone who contributes to those processes. I've long been a loud proponent of data driven decision making. So I'm absolutely not opposed to people doing research, but just as you said— we need to be acutely aware of the limitations of the research. Weak data is clearly better than no data, but only when you are aware of the strength of the data. Or, in other words, knowing what you don't know is often *the most critical* piece of information in any decision making process. In our eagerness to establish what we can and do know it can be easy to forget how much we don't know. Some of the limitations which are all too obvious to researchers are less than obvious to people who've never personally done quantitative analysis on Wikipedia data, yet many of these people are the decision makers that must do something useful with the data. The casual language used when researchers write for researchers can magnify misunderstandings. It was merely my intent to caution against the related risks. I think the most impactful contributions available for researchers today are less in the area of the direct research itself but are instead in advancing the art of researching Wikipedia. But the two go hand in hand, we can't advance the art if we don't do the research. The latter type is less sexy and not prone to generating headlines, but it is work that will last and generate citations for a long time. Measurements of X today will be soon forgotten as they are replaced by later analysis of the historical data using superior techniques. That my tone was somewhat negative is only due to my extreme disappointment in that our own discussion of recent measurements has been almost entirely devoid of critical analysis. Contributors patting themselves on the back and saying I told you so! seem to be outnumbering suggestions that the research might mean something else entirely, though perhaps that is my own bias speaking. To the extent that I'm wrong about that I hope that my comments were merely redundant, to the extent that I'm right I hope my points will invite nuanced understanding of the research and encourage people to seek out and expose potentially confounding variables and bad-proxies so that all our knowledge can be advanced. If this stuff were easy it would all be done already. Wikipedia research is interesting because it is both hard and potentially meaningful. There is room and need for contributions from everyone. Cheers! ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
[Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
There is another way to detect 100% reverts. It won't catch manual reverts that are not 100 accurate but most vandal patrollers will use undo, and the like. For every revision calculate md5 checksum of content. Then you can easily look back say 100 revisions to see whether this checksum occurred earlier. It is efficient and unambiguous. This will work for any Wikipedia for which a full archive dump is available. Erik Zachte ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
Nathan said: ...but certainly its (sic) more informative than a Wikipedia Review analysis of a relatively small group of articles in a specific topic area. And you are certainly entitled to a flawed opinion based on incorrect assumptions, such as ours being a Wikipedia Review analysis. But, nice try at a red herring argument. Greg ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
2009/8/20 Erik Zachte erikzac...@infodisiac.com: There is another way to detect 100% reverts. It won't catch manual reverts that are not 100 accurate but most vandal patrollers will use undo, and the like. For every revision calculate md5 checksum of content. Then you can easily look back say 100 revisions to see whether this checksum occurred earlier. It is efficient and unambiguous. A slightly less effective method would be to use the page size in bytes; this won't give the precise one-to-one matching, but as I believe it's already calculated in the data it might well be quicker. One other false positive here: edit warring where one or both sides is using undo/rollback. You'll get the impression of a lot of vandalism without there necessarily being any. -- - Andrew Gray andrew.g...@dunelm.org.uk ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 11:23 AM, Erik Zachte erikzac...@infodisiac.comwrote: There is another way to detect 100% reverts. It won't catch manual reverts that are not 100 accurate but most vandal patrollers will use undo, and the like. For every revision calculate md5 checksum of content. Then you can easily look back say 100 revisions to see whether this checksum occurred earlier. It is efficient and unambiguous. This will work for any Wikipedia for which a full archive dump is available. Erik Zachte Luca's WikiTrust could easily reveal this info. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 1:30 PM, Gregory Kohs thekoh...@gmail.com wrote: Nathan said: ...but certainly its (sic) more informative than a Wikipedia Review analysis of a relatively small group of articles in a specific topic area. And you are certainly entitled to a flawed opinion based on incorrect assumptions, such as ours being a Wikipedia Review analysis. But, nice try at a red herring argument. Greg Well, you can understand where I would get that idea - since the URL you provided had Wikipedia Review members performing the research, until you changed it a few minutes ago. http://www.mywikibiz.com/index.php?title=Wikipedia_Vandalism_Studydiff=90806oldid=89479 My point (which might still be incorrect, of course) was that an analysis based on 30,000 randomly selected pages was more informative about the English Wikipedia than 100 articles about serving United States Senators. Nathan ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
Apologies to Nathan regarding the Wikipedia Review description. The analysis team was, indeed, recruited via Wikipedia Review; however, almost all of the participants in the research have now departed or reduced their participation in Wikipedia Review to such a degree, I don't personally consider it to have been a Wikipedia Review effort at all. I allowed my personal opinions to interfere with my recollection of the facts, though, and that's not kosher. Again, I hope you'll accept my apology. I still maintain, however, that any study of the accuracy of or the vandalized nature of Wikipedia content will be far more reliable and meaningful if human assessment is the underlying mechanism of analysis, rather than a bot or script that will simply tally up things. I think that Rohde's design was inherently flawed, and I'm happy that Greg Maxwell and I both immediately recognized the danger of running off and reporting the good news, as Sue Gardner was apparently ready to do immediately. As I said, I feel that Rohde proceeded with research based on several highly questionable assumptions, while the 100 Senators research rather carefully outlined a research plan that carried very few assumptions, other than that you trust the analysts to intelligently recognize vandalism or not. Nathan, by praising Rohde's work and disparaging my own, you seem to be suggesting that you would prefer to live inside a giant mountain comprised of sticks and twigs, rather than in a small, pleasantly furbished 12' x 12' room. I just don't understand that line of thinking. I'd rather have a small bit of reliable data based on a stable premise, rather than a giant pile of data based on an unstable premise. Greg ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 1:55 PM, Nathan nawr...@gmail.com wrote: My point (which might still be incorrect, of course) was that an analysis based on 30,000 randomly selected pages was more informative about the English Wikipedia than 100 articles about serving United States Senators. Any automated method of finding vandalism is doomed to failure. I'd say its informativeness was precisely zero. Greg's analysis, on the other hand, was informative, but it was targeted at a much different question than Robert's. if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision The best way to answer that question would be with a manually processed random sample taken from a pre-chosen moment in time. As few as 1000 revisions would probably be sufficient, if I know anything about statistics, but I'll let someone with more knowledge of statistics verify or refute that. The results will depend heavily on one's definition of vandalism, though. On Thu, Aug 20, 2009 at 12:38 PM, Jimmy Wales jwa...@wikia-inc.com wrote: Is there a possibility of re-running the numbers to include traffic weightings? definitely should be done I would hypothesize from experience that if we adjust the random page selection to account for traffic (to get a better view of what people are actually seeing) we would see slightly different results. I think we'd see drastically different results. I think we would see a lot less (percentagewise) vandalism that persists for a really long time for precisely the reason you identified: most vandalism that lasts a long time, lasts a long time because it is on obscure pages that no one is visiting. Agreed. On the other hand, I think we'd also see that pages with more traffic are more likely to be vandalized. Of course, this assumes a valid methodology. Using admin rollback, the undo function, the revert bots, various editing tools, and commonly used phrases like rv, rvv, etc. to find vandalism is heavily skewed toward vandalism that doesn't last very long (or at least doesn't last very many edits). It's basically useless. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
2009/8/20 Gregory Maxwell gmaxw...@gmail.com: Going back to your simple study now: The analysis of vandalism duration and its impact on readers makes an assumption about readership which we know to be invalid. You're assuming a uniform distribution of readership: That readers are just as likely to read any random article. But we know that the actual readership follows a power-law (long-tail) distribution. Because of the failure to consider traffic levels we can't draw conclusions on how much vandalism readers are actually exposed to. We're also assuming a uniform distribution of vandalism, as it were. There's a number of different types of vandalism; obscene defacement, malicious alteration of factual content, meaningless test edits of a character or two, schoolkids leaving messages for each other... ...and it all has a different impact on the reader. This has two implications: a) It seems safe to assume that replacing the entire article with john is gay is going to get spotted and reverted faster, on average, than an edit providing a plausible-sounding but entirely fictional history for a small town in Kansas. So, any changes in the pattern of the *content* of vandalism is going to lead to changes in the duration and thus overall frequency of it, even if the amount of vandal edits is constant. b) We can easily compare the difference in effect for vandalism to be left on differently trafficed pages for various times - roughly speaking, time * traffic = number of readers affected. If some vandalism is worse than others, we could thus also calculate some kind of intensity metric - one hundred people viewing enormous genital piercing images on [[Kitten]] is probably worse than ten thousand people viewing asdfdfggfh at the end of a paragraph in the same article. I'm not sure how we'd go ahead with the second one, but it's an interesting thing to think about. -- - Andrew Gray andrew.g...@dunelm.org.uk ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 2:10 PM, Anthonywikim...@inbox.org wrote: On Thu, Aug 20, 2009 at 1:55 PM, Nathan nawr...@gmail.com wrote: My point (which might still be incorrect, of course) was that an analysis based on 30,000 randomly selected pages was more informative about the English Wikipedia than 100 articles about serving United States Senators. Any automated method of finding vandalism is doomed to failure. I'd say its informativeness was precisely zero. Greg's analysis, on the other hand, was informative, but it was targeted at a much different question than Robert's. if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision The best way to answer that question would be with a manually processed random sample taken from a pre-chosen moment in time. As few as 1000 revisions would probably be sufficient, if I know anything about statistics, but I'll let someone with more knowledge of statistics verify or refute that. The results will depend heavily on one's definition of vandalism, though. Only in dreadfully obvious cases can you look at a revision by itself and know it contains vandalism. If the goal is really to characterize whether any vandalism has persisted in an article from any time in the past, then one really needs to look at the full edit history to see what has been changed / removed over time. Even at the level of randomly sampling 1000 revisions, doing an real evaluation of the full history is likely to be impractical for any manual process. If however you restrict yourself to asking whether 1000 edits contributed vandalism, then you have a relatively manageable task, and one that is more closely analogous to the technical program I set up. If it helps one can think of what I did as trying to characterize reverts and detect the persistence of new vandalism rather than vandalism in general. And of course, only new vandalism could be fixed by an immediate rollback / revert anyway. Qualitatively I tend to think that vandalism that has persisted through many intervening revisions is in a rather different category than new vandalism. Since people rarely look at or are aware of an articles' ancient past, such persistent vandalism is at that point little different than any other error in an article. It is something to be fixed, but you won't usually be able to recognize it as a malicious act. On Thu, Aug 20, 2009 at 12:38 PM, Jimmy Wales jwa...@wikia-inc.com wrote: Is there a possibility of re-running the numbers to include traffic weightings? definitely should be done Does anyone have a nice comprehensive set of page traffic aggregated at say a month level? The raw data used by stats.grok.se, etc. is binned hourly which opens one up to issues of short-term fluctuations, but I'm not at all interested in downloading 35 GB of hourly files just to construct my own long-term averages. I would hypothesize from experience that if we adjust the random page selection to account for traffic (to get a better view of what people are actually seeing) we would see slightly different results. I think we'd see drastically different results. If I had to make a prediction, I'd expect one might see numerically higher rates of vandalism and shorter average durations, but otherwise qualitatively similar results given the same methodology. I agree though that it would be worth doing the experiment. I think we would see a lot less (percentagewise) vandalism that persists for a really long time for precisely the reason you identified: most vandalism that lasts a long time, lasts a long time because it is on obscure pages that no one is visiting. Agreed. On the other hand, I think we'd also see that pages with more traffic are more likely to be vandalized. Of course, this assumes a valid methodology. Using admin rollback, the undo function, the revert bots, various editing tools, and commonly used phrases like rv, rvv, etc. to find vandalism is heavily skewed toward vandalism that doesn't last very long (or at least doesn't last very many edits). It's basically useless. Yes, as I acknowledged above, new vandalism. My personal interest is also skewed in that direction. If you don't like it and don't find it useful, feel free to ignore me and/or do your own analysis. Vandalism that has persisted through many revisions is a qualitatively different critter than most new vandalism. It's usually hard to identify, even by a manual process, and is unlikely to be fixed except through the normal editoral process of review, fact-checking, and revision. When vandalism is new people are at least paying attention to it in particular, and all vandalism starts out that way. Perhaps it would be more useful if you think of this work as a characterization of revert statistics? Anyway, I provided my data point and described what I did so others could judge it for themselves. Regardless of your opinion, it
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
2009/8/20 Jimmy Wales jwa...@wikia-inc.com: Robert Rohde wrote: When one downloads a dump file, what percentage of the pages are actually in a vandalized state? This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision? Is there a possibility of re-running the numbers to include traffic weightings? I'd like to see that data too. I'm sure you are right, vandalism doesn't last as long on popular pages, but it would be very interesting to see how much quicker it is reverted and how popular a page needs to be for that to apply (or if it is a gradual improvement). ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
Robert Rohde wrote: Does anyone have a nice comprehensive set of page traffic aggregated at say a month level? The raw data used by stats.grok.se, etc. is binned hourly which opens one up to issues of short-term fluctuations, but I'm not at all interested in downloading 35 GB of hourly files just to construct my own long-term averages. I don't have every article, but I have the data for July 09 for ~600,000 pages on enwiki (mostly articles). It also has the hit counts for redirects aggregated with the article, not sure if that would be more or less useful for you. Let me know if you want it, its in a MySQL table on the toolserver right now. -- Alex (wikipedia:en:User:Mr.Z-man) ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 6:36 PM, Robert Rohde raro...@gmail.com wrote: On Thu, Aug 20, 2009 at 2:10 PM, Anthonywikim...@inbox.org wrote: if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision The best way to answer that question would be with a manually processed random sample taken from a pre-chosen moment in time. As few as 1000 revisions would probably be sufficient, if I know anything about statistics, but I'll let someone with more knowledge of statistics verify or refute that. The results will depend heavily on one's definition of vandalism, though. Only in dreadfully obvious cases can you look at a revision by itself and know it contains vandalism. If the goal is really to characterize whether any vandalism has persisted in an article from any time in the past, then one really needs to look at the full edit history to see what has been changed / removed over time. I wouldn't suggest looking at the edit history at all, just the most recent revision as of whatever moment in time is chosen. If vandalism is found, then and only then would one look through the edit history to find out when it was added. Of course, this assumes a valid methodology. Using admin rollback, the undo function, the revert bots, various editing tools, and commonly used phrases like rv, rvv, etc. to find vandalism is heavily skewed toward vandalism that doesn't last very long (or at least doesn't last very many edits). It's basically useless. Yes, as I acknowledged above, new vandalism. New vandalism which has not yet been reverted wouldn't be included. My personal interest is also skewed in that direction. If you don't like it and don't find it useful, feel free to ignore me and/or do your own analysis. I do. I also feel free to criticize your methods publicly, since you decided to share them publicly. Anyway, I provided my data point and described what I did so others could judge it for themselves. Regardless of your opinion, it addressed an issue of interest to me, and I would hope others also find some useful insight in it. And I presented my criticism, which hopefully other will find some useful insight in as well. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
2009/8/20 Anthony wikim...@inbox.org: I wouldn't suggest looking at the edit history at all, just the most recent revision as of whatever moment in time is chosen. If vandalism is found, then and only then would one look through the edit history to find out when it was added. That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 6:57 PM, Thomas Dalton thomas.dal...@gmail.comwrote: 2009/8/20 Anthony wikim...@inbox.org: I wouldn't suggest looking at the edit history at all, just the most recent revision as of whatever moment in time is chosen. If vandalism is found, then and only then would one look through the edit history to find out when it was added. That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year. No need for the article to be referenced at all, but yes, it would be time consuming, or at least person-time consuming. On the other hand, it'd answer the question, in a way that an automated process never could do (assuming I've got my statistical analysis right, anyway: http://www.raosoft.com/samplesize.html seems to suggest a 99% confidence level for 664 random samples out of 3 million, but I'm not sure what response distribution means). ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 3:57 PM, Thomas Daltonthomas.dal...@gmail.com wrote: 2009/8/20 Anthony wikim...@inbox.org: I wouldn't suggest looking at the edit history at all, just the most recent revision as of whatever moment in time is chosen. If vandalism is found, then and only then would one look through the edit history to find out when it was added. That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year. It's not just facts. There are many ways to degrade the qualify of an article (such as removing entire sections) that would be invisible if one looks at only one revision. Anthony seems to be talking about a question of article accuracy (unless I am misreading him). That is overlapping issue with addressing vandalism, but there are a significant number of ways to commit vandalism that nonetheless have nothing to do with impairing the resulting article's accuracy. -Robert Rohde ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
2009/8/21 Anthony wikim...@inbox.org: On Thu, Aug 20, 2009 at 6:57 PM, Thomas Dalton thomas.dal...@gmail.comwrote: 2009/8/20 Anthony wikim...@inbox.org: I wouldn't suggest looking at the edit history at all, just the most recent revision as of whatever moment in time is chosen. If vandalism is found, then and only then would one look through the edit history to find out when it was added. That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year. No need for the article to be referenced at all, but yes, it would be time consuming, or at least person-time consuming. You mean you could go and find references for the information yourself? I suppose you could, but that is completely impractical. On the other hand, it'd answer the question, in a way that an automated process never could do (assuming I've got my statistical analysis right, anyway: http://www.raosoft.com/samplesize.html seems to suggest a 99% confidence level for 664 random samples out of 3 million, but I'm not sure what response distribution means). The site looks like it is for surveys made up of yes/no questions, I don't think it is going to apply to this. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 7:13 PM, Robert Rohde raro...@gmail.com wrote: On Thu, Aug 20, 2009 at 3:57 PM, Thomas Daltonthomas.dal...@gmail.com wrote: 2009/8/20 Anthony wikim...@inbox.org: I wouldn't suggest looking at the edit history at all, just the most recent revision as of whatever moment in time is chosen. If vandalism is found, then and only then would one look through the edit history to find out when it was added. That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year. It's not just facts. There are many ways to degrade the qualify of an article (such as removing entire sections) that would be invisible if one looks at only one revision. I guess that's true. People could be removing facts, for instance, which wouldn't be apparently by looking at one revision. So such an analysis would potentially understate actual vandalism. But at least we'd know in which direction the percentage is potentially wrong. And anecdotally, I don't think the understatement would be significant. There's also the question of whether or not we want to count an article which had a fact removed a few years ago and never re-added to be a vandalized revision. Anthony seems to be talking about a question of article accuracy (unless I am misreading him). I'm attempting to best answer the question if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision, which I take to have nothing whatsoever to do with the number of reverts. That is overlapping issue with addressing vandalism, but there are a significant number of ways to commit vandalism that nonetheless have nothing to do with impairing the resulting article's accuracy. Significant number? I can only think of a handful. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 7:20 PM, Thomas Dalton thomas.dal...@gmail.comwrote: 2009/8/21 Anthony wikim...@inbox.org: On Thu, Aug 20, 2009 at 6:57 PM, Thomas Dalton thomas.dal...@gmail.com wrote: 2009/8/20 Anthony wikim...@inbox.org: I wouldn't suggest looking at the edit history at all, just the most recent revision as of whatever moment in time is chosen. If vandalism is found, then and only then would one look through the edit history to find out when it was added. That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year. No need for the article to be referenced at all, but yes, it would be time consuming, or at least person-time consuming. You mean you could go and find references for the information yourself? I suppose you could, but that is completely impractical. My God. If a few dozen people couldn't easily determine to a relatively high degree of certainty what portion of a mere 0.03% of Wikipedia's articles are *vandalized*, how useless is Wikipedia? On the other hand, it'd answer the question, in a way that an automated process never could do (assuming I've got my statistical analysis right, anyway: http://www.raosoft.com/samplesize.html seems to suggest a 99% confidence level for 664 random samples out of 3 million, but I'm not sure what response distribution means). The site looks like it is for surveys made up of yes/no questions, I don't think it is going to apply to this. Is this article vandalized? is a yes/no question... ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
2009/8/21 Anthony wikim...@inbox.org: My God. If a few dozen people couldn't easily determine to a relatively high degree of certainty what portion of a mere 0.03% of Wikipedia's articles are *vandalized*, how useless is Wikipedia? I never said they couldn't. I said they couldn't do it by just looking at the most recent revision. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
2009/8/21 Anthony wikim...@inbox.org: Is this article vandalized? is a yes/no question... True, but that isn't actually the question that this research tried to answer. It tried to answer How much time has this article spent in a vandalised state?. If we are only interested in whether the most recent revision is vandalised then that is a simpler problem but would require a much larger sample to get the same quality of data. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 4:37 PM, Anthonywikim...@inbox.org wrote: On Thu, Aug 20, 2009 at 7:13 PM, Robert Rohde raro...@gmail.com wrote: On Thu, Aug 20, 2009 at 3:57 PM, Thomas Daltonthomas.dal...@gmail.com wrote: 2009/8/20 Anthony wikim...@inbox.org: I wouldn't suggest looking at the edit history at all, just the most recent revision as of whatever moment in time is chosen. If vandalism is found, then and only then would one look through the edit history to find out when it was added. That only works if the article is very well referenced and you have all the references and are willing to fact-check everything. Otherwise you will miss subtle vandalism like changing the date of birth by a year. It's not just facts. There are many ways to degrade the qualify of an article (such as removing entire sections) that would be invisible if one looks at only one revision. I guess that's true. People could be removing facts, for instance, which wouldn't be apparently by looking at one revision. So such an analysis would potentially understate actual vandalism. But at least we'd know in which direction the percentage is potentially wrong. And anecdotally, I don't think the understatement would be significant. You seem to be identifying all errors with vandalism. Sometimes factual errors are simply unintentional mistakes. I agree that accuracy is important, but I think you are thinking about the question somewhat differently than I am. snip I'm attempting to best answer the question if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision, which I take to have nothing whatsoever to do with the number of reverts. Let me describe the issue differently. The practical issue I am concerned with might be better expressed as the following: For any given article, what is the probability that the current revision is not the best available revision (i.e. most accurate, most complete, etc.) Vandalism, in general, takes a page and makes it worse. I am interested in the problem of characterizing how often this happens with an eye to being able to go back to that prior better version. (This also explains why I am less interested in vandalism that persists through many revisions. Once that occurs, it makes less sense to try and go back to the pre-vandalized revision.) Your concern for establishing overall article accuracy is a good one, but it is largely orthogonal to my interest in figuring out whether the current revision is likely to be better or worse than the revisions that came before it. -Robert Rohde ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 7:54 PM, Thomas Dalton thomas.dal...@gmail.comwrote: 2009/8/21 Anthony wikim...@inbox.org: Is this article vandalized? is a yes/no question... True, but that isn't actually the question that this research tried to answer. It tried to answer How much time has this article spent in a vandalised state?. When one downloads a dump file, what percentage of the pages are actually in a vandalized state? This is equivalent to asking, if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision? That's the question I was referring to. If we are only interested in whether the most recent revision is vandalised then that is a simpler problem but would require a much larger sample to get the same quality of data. How much larger? Do you know anything about this, or you're just guessing? The number of random samples needed for a high degree of confidence tends to be much much less than most people suspect. That much I know. I found one problem with my use of http://www.raosoft.com/samplesize.html http://www.raosoft.com/samplesize.htmlI was specifying a margin of error of 5%. But that's an absolute margin of error. So if it were 0.2% vandalism, that'd be 0.2% plus or minus 5%. Obviously unacceptable. However, the response distribution would then be 0.2%. This still would require 7649 samples for a 95% confidence plus or minus 0.1%. If the vandalism turned out to be more prevalent though, and I suspect it would, we could for instance be 95% confident plus or minus 0.5% if the response distribution was 0.5% and we had 765 samples. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 7:58 PM, Robert Rohde raro...@gmail.com wrote: You seem to be identifying all errors with vandalism. How so? Sometimes factual errors are simply unintentional mistakes. Obviously we can't know the intent of the person for sure, but after a mistake is found it's relatively simple to find where it was added and decide whether or not we are going to call it vandalism. This is an inherent problem with answering the question. If you can't determine it manually, you sure as hell won't be able to determine it using automated methods. Let me describe the issue differently. The practical issue I am concerned with might be better expressed as the following: For any given article, what is the probability that the current revision is not the best available revision (i.e. most accurate, most complete, etc.) Vandalism, in general, takes a page and makes it worse. I am interested in the problem of characterizing how often this happens with an eye to being able to go back to that prior better version. (This also explains why I am less interested in vandalism that persists through many revisions. Once that occurs, it makes less sense to try and go back to the pre-vandalized revision.) *nod*. Yes, we certainly have different things we're interested in measuring. If someone vandalizes an article, say to change the population of a country from 3 million to 2.9 million, and then 20 other people improve the article without fixing that fact, I'd still count that as vandalized. On the other hand, are you sure you don't want to add an indisputably before not the best available revision? I mean, I'd say Wikipedia is probably in the double digit percentages, at least in terms of popular articles, if you don't add indisputably. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
Riddle me this... Is the edit below vandalism? http://en.wikipedia.org/w/index.php?title=Arch_Coaldiff=255482597oldid=255480884 Did the edit take a page and make it worse? Or, did it make the page a better available revision than the version immediately prior to it? Methinks the Wikipedia community has a long way to go in learning to differentiate between a better encyclopedia and a worse encyclopedia before we take the step to try to define vandalism. Then, after we've done all that, there might be some remaining value in trying to quantify vandalism, as we've defined it. Until then, for God's sake, Sue Gardner, do not gleefully run off publicizing that only 0.4% of Wikipedia's articles are vandalized. Greg ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 14:10, Anthonywikim...@inbox.org wrote: On Thu, Aug 20, 2009 at 1:55 PM, Nathan nawr...@gmail.com wrote: My point (which might still be incorrect, of course) was that an analysis based on 30,000 randomly selected pages was more informative about the English Wikipedia than 100 articles about serving United States Senators. Any automated method of finding vandalism is doomed to failure. I'd say its informativeness was precisely zero. Greg's analysis, on the other hand, was informative, but it was targeted at a much different question than Robert's. if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision The best way to answer that question would be with a manually processed random sample taken from a pre-chosen moment in time. As few as 1000 revisions would probably be sufficient, if I know anything about statistics, but I'll let someone with more knowledge of statistics verify or refute that. The results will depend heavily on one's definition of vandalism, though. I did this in an informal fashion in 2005 during my hundred article surveys. Of the 503 pages I looked at, only one was clearly vandalized the first time I looked at it, so I'd say a thousand samples is probably too small to get any sort of precision on the vandalism rate. -- Mark Wagner [[User:Carnildo]] ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 9:30 PM, Mark Wagner carni...@gmail.com wrote: On Thu, Aug 20, 2009 at 14:10, Anthonywikim...@inbox.org wrote: if one chooses a random page from Wikipedia right now, what is the probability of receiving a vandalized revision The best way to answer that question would be with a manually processed random sample taken from a pre-chosen moment in time. As few as 1000 revisions would probably be sufficient, if I know anything about statistics, but I'll let someone with more knowledge of statistics verify or refute that. The results will depend heavily on one's definition of vandalism, though. I did this in an informal fashion in 2005 during my hundred article surveys. Of the 503 pages I looked at, only one was clearly vandalized the first time I looked at it, so I'd say a thousand samples is probably too small to get any sort of precision on the vandalism rate. Why? My understanding is that, if your methodology was correct, you can say with 96% confidence that the percentage of vandalized articles is less than 0.6%. That's useful. With 1000 samples, if you found two instances of vandalism, you'd have a 97% confidence that the percentage of vandalized articles is less than 0.5%. I don't think it's that low, but if you publish the details of your hundred article surveys, I might be persuaded that it is. If we really do have that figure to that level of assurance, a more useful statistic would be the percentage of pageviews that result in a vandalized article. That could be arrived at by weighting by pageviews while choosing your random sample. One flaw I found in my proposed methodology is that the moment in time needs to be randomized, since certain times of the day/week/year might very well experience higher vandalism than others. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
Phil Nash wrote: Many editors undo and revert on the basis of felicity of language and emphasis, and unless it becomes an issue is an epiphenomenon of the encyclopedia that anyone can edit. so I can't see how this is a good example of anything in particular. And, with point proven, I rest my case. Greg ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
And here is where many of the flaws of the University of Minnesota study were exposed: http://chance.dartmouth.edu/chancewiki/index.php/Chance_News_31#The_Unbreakable_Wikipedia.3F Their methodology of tracking the persistence of words was questionable, to say the least. And here was my favorite part: *We exclude anonymous editors from some analyses, because IPs are not stable: multiple edits by the same human might be recorded under different IPs, and multiple humans can share an IP.* So, in a study evaluating the damaged views within 34 trillion edits, they excluded the 9 trillion edits by IP addresses? If you're not laughing right now, then you must be new to Wikipedia. Greg ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles
On Thu, Aug 20, 2009 at 11:02 PM, Gregory Kohs thekoh...@gmail.com wrote: And here was my favorite part: *We exclude anonymous editors from some analyses, because IPs are not stable: multiple edits by the same human might be recorded under different IPs, and multiple humans can share an IP.* I have to say that this one was better: We believe it is reasonable to assume that essentially all damage is repaired within 15 revisions. Talk about begging the question. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l