Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-21 Thread Thomas Dalton
2009/8/21 Anthony wikim...@inbox.org:
 If we are only interested in whether the most
 recent revision is vandalised then that is a simpler problem but would
 require a much larger sample to get the same quality of data.


 How much larger?  Do you know anything about this, or you're just guessing?
  The number of random samples needed for a high degree of confidence tends
 to be much much less than most people suspect.  That much I know.

I have a Masters degree in Mathematics, so I know a little about the
subject. (I didn't study much statistics, but you can't do 4 years of
Maths at Uni without getting some basic understanding of it.)

You say it requires 7649 articles, which sounds about right to me. If
we looked through the entire history (or just the last year or 6
months or something if you want just recent data) then we could do it
with significantly fewer articles. I'm not sure how many we would
need, though. I think we need to know what the distribution is for how
long a randomly chosen article spends in a vandalised state before we
can work out what the distribution of the average would be. My
statistics isn't good enough to even work out what kind of
distribution it is likely to be, I certainly can't guess at the
parameters. It obviously ranges between 0% and 100% with the mean
somewhere close to 0% (0.4% seems like a good estimate) and will
presumably have a long tail (truncated at 100%) - there are articles
that spend their entire life in a vandalised state (attack pages, for
example) and there is a chance we'll completely miss such a page and
it will last the entire length of the survey period, so the
probability density at 100% won't be 0. I'm sure there is a
distribution that satisfies those requirements, but I don't know what
it is.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


[Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Robert Rohde
I am supposed to be taking a wiki-vacation to finish my PhD thesis and
find a job for next year.  However, this afternoon I decided to take a
break and consider an interesting question recently suggested to me by
someone else:

When one downloads a dump file, what percentage of the pages are
actually in a vandalized state?

This is equivalent to asking, if one chooses a random page from
Wikipedia right now, what is the probability of receiving a vandalized
revision?

Understanding what fraction of Wikipedia is vandalized at any given
instant is obviously of both practical and public relations interest.
In addition it bears on the motivation for certain development
projects like flagged revisions.  So, I decided to generate a rough
estimate.

For the purposes of making an estimate I used the main namespace of
the English Wikipedia and adopted the following operational
approximations:  I considered that vandalism is that thing which
gets reverted, and that reverts are those edits tagged with revert,
rv, undo, undid, etc. in the edit summary line.  Obviously, not all
vandalism is cleanly reverted, and not all reverts are cleanly tagged.
 In addition, some things flagged as reverts aren't really addressing
what we would conventionally consider to be vandalism.  Such caveats
notwithstanding, I have had some reasonable success with using a
revert heuristic in the past.  With the right keywords one can easily
catch the standardized comments created by admin rollback, the undo
function, the revert bots, various editing tools, and commonly used
phrases like rv, rvv, etc.  It won't be perfect, but it is a quick
way of getting an automated estimate.  I would usually expect the
answer I get in this way to be correct within an order of magnitude,
and perhaps within a factor of a few, though it is still just a crude
estimate.

I analyzed the edit history up to the mid-June dump for a sample
29,999 main namespace pages (sampling from everything in main
including redirects).  This included 1,333,829 edits, from which I
identified 102,926 episodes of reverted vandalism.  As a further
approximation, I assumed that whenever a revert occurred, it applied
to the immediately preceding edit and any additional consecutive
changes by the same editor (this is how admin rollback operates, but
is not necessarily true of tools like undo).

With those assumptions, I then used the timestamps on my identified
intervals of vandalism to figure out how much time each page had spent
in a vandalized state.  Over the entire history of Wikipedia, this
sample of pages was vandalized during 0.28% of its existence.  Or,
more relevantly, focusing on just this year vandalism was present
0.21% of the time, which suggests that one should expect 0.21% of
mainspace pages in any recent enwiki dump will be in a vandalized
state (i.e. 1 in 480).

(Note that since redirects represent 55% of the main namespace and are
rarely vandalized, one could argue that 0.37% [1 in 270] would be a
better estimate for the portion of actual articles that are in a
vandalized condition at any given moment.)

I also took a look at the time distribution of vandalism.  Not
surprisingly, it has a very long tail.  The median time to revert over
the entire history is 6.7 minutes, but the mean time to revert is 18.2
hours, and my sample included one revert going back 45 months (though
examples of such very long lags also imply the page had gone years
without any edits, which would imply an obscure topic that was also
almost never visited).  In the recent period these factors becomes 5.2
minutes and 14.4 hours for the median and mean respectively.  The
observation that nearly 50% of reverts are occurring in 5 minutes or
less is a testament to the efficient work of recent changes reviewers
and watchlists.

Unfortunately the 5% of vandalism that persists longer than 35 hours
is responsible for 90% of the actual vandalism a visitor is likely to
encounter at random.  Hence, as one might guess, it is the vandalism
that slips through and persists the longest that has the largest
practical effect.

It is also worth noting that the prevalence figures for February-May
of this year are slightly lower than at any time since 2006.  There is
also a drop in the mean duration of vandalism coupled to a slight
increase in the median duration.  However, these effects mostly
disappear if we limit our considerations to only vandalism events
lasting 1 month or shorter.  Hence those changes may be in significant
part linked to cut-off biasing from longer-term vandalism events that
have yet to be identified.  The ambiguity in the change from earlier
in the year is somewhat surprising as the AbuseFilter was launched in
March and was intended to decrease the burden of vandalism.  One might
speculate that the simple vandalism amenable to the AbuseFilter was
already being addressed quickly in nearly all cases and hence its
impact on the persistence of vandalism may already have been fairly
limited.

I've 

Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Sue Gardner
Robert, thanks for this.  I have long wanted that number: it is really 
interesting.

-Original Message-
From: Robert Rohde raro...@gmail.com

Date: Thu, 20 Aug 2009 03:06:06 
To: Wikimedia Foundation Mailing Listfoundation-l@lists.wikimedia.org; 
English Wikipediawikie...@lists.wikimedia.org
Cc: Sean Moss-Pultzs...@openmoko.com; s...@parc.com
Subject: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles


I am supposed to be taking a wiki-vacation to finish my PhD thesis and
find a job for next year.  However, this afternoon I decided to take a
break and consider an interesting question recently suggested to me by
someone else:

When one downloads a dump file, what percentage of the pages are
actually in a vandalized state?

This is equivalent to asking, if one chooses a random page from
Wikipedia right now, what is the probability of receiving a vandalized
revision?

Understanding what fraction of Wikipedia is vandalized at any given
instant is obviously of both practical and public relations interest.
In addition it bears on the motivation for certain development
projects like flagged revisions.  So, I decided to generate a rough
estimate.

For the purposes of making an estimate I used the main namespace of
the English Wikipedia and adopted the following operational
approximations:  I considered that vandalism is that thing which
gets reverted, and that reverts are those edits tagged with revert,
rv, undo, undid, etc. in the edit summary line.  Obviously, not all
vandalism is cleanly reverted, and not all reverts are cleanly tagged.
 In addition, some things flagged as reverts aren't really addressing
what we would conventionally consider to be vandalism.  Such caveats
notwithstanding, I have had some reasonable success with using a
revert heuristic in the past.  With the right keywords one can easily
catch the standardized comments created by admin rollback, the undo
function, the revert bots, various editing tools, and commonly used
phrases like rv, rvv, etc.  It won't be perfect, but it is a quick
way of getting an automated estimate.  I would usually expect the
answer I get in this way to be correct within an order of magnitude,
and perhaps within a factor of a few, though it is still just a crude
estimate.

I analyzed the edit history up to the mid-June dump for a sample
29,999 main namespace pages (sampling from everything in main
including redirects).  This included 1,333,829 edits, from which I
identified 102,926 episodes of reverted vandalism.  As a further
approximation, I assumed that whenever a revert occurred, it applied
to the immediately preceding edit and any additional consecutive
changes by the same editor (this is how admin rollback operates, but
is not necessarily true of tools like undo).

With those assumptions, I then used the timestamps on my identified
intervals of vandalism to figure out how much time each page had spent
in a vandalized state.  Over the entire history of Wikipedia, this
sample of pages was vandalized during 0.28% of its existence.  Or,
more relevantly, focusing on just this year vandalism was present
0.21% of the time, which suggests that one should expect 0.21% of
mainspace pages in any recent enwiki dump will be in a vandalized
state (i.e. 1 in 480).

(Note that since redirects represent 55% of the main namespace and are
rarely vandalized, one could argue that 0.37% [1 in 270] would be a
better estimate for the portion of actual articles that are in a
vandalized condition at any given moment.)

I also took a look at the time distribution of vandalism.  Not
surprisingly, it has a very long tail.  The median time to revert over
the entire history is 6.7 minutes, but the mean time to revert is 18.2
hours, and my sample included one revert going back 45 months (though
examples of such very long lags also imply the page had gone years
without any edits, which would imply an obscure topic that was also
almost never visited).  In the recent period these factors becomes 5.2
minutes and 14.4 hours for the median and mean respectively.  The
observation that nearly 50% of reverts are occurring in 5 minutes or
less is a testament to the efficient work of recent changes reviewers
and watchlists.

Unfortunately the 5% of vandalism that persists longer than 35 hours
is responsible for 90% of the actual vandalism a visitor is likely to
encounter at random.  Hence, as one might guess, it is the vandalism
that slips through and persists the longest that has the largest
practical effect.

It is also worth noting that the prevalence figures for February-May
of this year are slightly lower than at any time since 2006.  There is
also a drop in the mean duration of vandalism coupled to a slight
increase in the median duration.  However, these effects mostly
disappear if we limit our considerations to only vandalism events
lasting 1 month or shorter.  Hence those changes may be in significant
part linked to cut-off biasing from longer-term vandalism events 

Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Gregory Maxwell
On Thu, Aug 20, 2009 at 6:06 AM, Robert Rohderaro...@gmail.com wrote:
[snip]
 When one downloads a dump file, what percentage of the pages are
 actually in a vandalized state?

Although you don't actually answer that question, you answer a
different question:

[snip]
 approximations:  I considered that vandalism is that thing which
 gets reverted, and that reverts are those edits tagged with revert,
 rv, undo, undid, etc. in the edit summary line.  Obviously, not all
 vandalism is cleanly reverted, and not all reverts are cleanly tagged.


Which is interesting too, but part of the problem with calling this a
measure of vandalism is that it isn't really, and we don't really have
a good handle on how solid an approximation it is beyond gut feelings
and arm-waving.

The study of Wikipedia activity is a new area of research, not
something that has been studied for decades. Not only do we not know
many things about Wikipedia, but we don't know many things about how
to know things about Wikipedia.


There must be ways to get a better understanding, but we many not know
of them and the ones we do know of are not always used. For example,
we could increase our confidence in this type of proxy-measure by
taking a random subset of that data and having humans classify it
based on some agreed-on established criteria. By performing the review
process many times we could get a handle on the typical error of both
the proxy-metric and the meta-review.

The risk here is that people will misunderstand these shorthand
metrics as the real-deal and the risk is increased when we encourage
it by using language which suggests that the simplistic understanding
is the correct one.  IMO, highly uncertain and/or outright wrong
information is worse than not knowing when you aren't aware of the
reliability of the information.

We can't control how the press chooses to report on research, but when
we actively encourage misunderstandings by playing up the significance
or generality of our research our behaviour is unethical. Vigilance is
required.

This risk of misinformation is increased many-fold in comparative
analysis, where factors like time are plotted against indicators
because we often miss confounding variables
(http://en.wikipedia.org/wiki/Confounding).

Stepping away from your review for a moment, because it wasn't
primarily a comparative one, I'd like to point out some general
points:

For example, If research finds that edits are more frequently reverted
over time is this because there has been a change in the revision
decision process or have articles become better and more complete over
time and have edits to long and high quality articles always been more
likely to be reverted?   Both are probably true, but how does the
contribution break down?

There are many other possibly significant confounding variables.
Probably many more than any of us have thought of yet.

I've always been of the school of thought that we do research to
produce understanding, not just generate numbers and Wikipedia
becomes more complete over time, less work for new people to do is a
pretty different understanding from Wikipedia increasing hostile
towards new contributors are pretty different understandings but both
may be supported by the same data at least until you've controlled for
many factors.

Another example— because of the scale of Wikipedia we must resort to
proxy-metrics. We can't directly measure vandalism, but we can measure
how often someone adds is gay over time. Proxy-metrics are powerful
tools but can be misleading.  If we're trying to automatically
identify vandalism for a study (either to include it or exclude it) we
have the risk that the vandals are adapting to automatic
identification:  If you were using is gay as a measure of vandalism
over time you might conclude that vandalism is decreasing when in
reality cluebot is performing the same kind of analysis for its
automatic vandalism suppression and the vandals have responded by
vandalizing in forms that can't be automatically identified, such as
by changing dates to incorrect values.

Or, keeping the goal of understanding in mind, sometimes the
measurements can all be right but a lack of care and consideration can
still cause people to draw the wrong conclusions.  For example,
English Wikipedia has adopted a much stronger policy about citations
in articles about living people than it once had. It is
*intentionally* more difficult to contribute to those articles
especially for new contributors who do not know the rules then it once
was.

Going back to your simple study now:  The analysis of vandalism
duration and its impact on readers makes an assumption about
readership which we know to be invalid. You're assuming a uniform
distribution of readership: That readers are just as likely to read
any random article. But we know that the actual readership follows a
power-law (long-tail) distribution. Because of the failure to consider
traffic levels we can't draw conclusions on how 

Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Marco Chiesa
On Thu, Aug 20, 2009 at 12:06 PM, Robert Rohderaro...@gmail.com wrote:

 Given the nature of the approximations I made in doing this analysis I
 suspect it is more likely that I have somewhat underestimated the
 vandalism problem rather than overestimated it, but as I said in the
 beginning I'd like to believe I am in the right ballpark.  If that's
 true, I personally think that having less than 0.5% of Wikipedia be
 vandalized at any given instant is actually rather comforting.  It's
 not a perfect number, but it would suggest that nearly everyone still
 gets to see Wikipedia as intended rather than in a vandalized state.
 (Though to be fair I didn't try to figure out if the vandalism
 occurred in more frequently visited parts or not.)

Thanks for the excellent analysis, Robert. Just to give an idea of
what 0.4% means in practice, you can think in terms of one country, 12
US counties, 33 Italian municipalities, 147 French municipalities or 1
Pope

Cruccone

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Jimmy Wales
Robert Rohde wrote:
 When one downloads a dump file, what percentage of the pages are
 actually in a vandalized state?
 
 This is equivalent to asking, if one chooses a random page from
 Wikipedia right now, what is the probability of receiving a vandalized
 revision?

Is there a possibility of re-running the numbers to include traffic 
weightings?

I would hypothesize from experience that if we adjust the random page 
selection to account for traffic (to get a better view of what people 
are actually seeing) we would see slightly different results.

I think we would see a lot less (percentagewise) vandalism that persists 
for a really long time for precisely the reason you identified: most 
vandalism that lasts a long time, lasts a long time because it is on 
obscure pages that no one is visiting.  That doesn't mean it is not a 
problem, but it does change some thinking about what kinds of tools are 
needed to deal with that problem.

I'm not sure what else would change.



___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Jimmy Wales
Gregory Maxwell wrote:
 If you were using is gay as a measure of vandalism
 over time you might conclude that vandalism is decreasing when in
 reality cluebot is performing the same kind of analysis for its
 automatic vandalism suppression and the vandals have responded by
 vandalizing in forms that can't be automatically identified, such as
 by changing dates to incorrect values.

And if that's true, that's on net a bad thing.  Most is gay vandalism 
(not all) is just stupid embarassing and it will be obvious to the 
reader as vandalism, and lots of people get how Wikipedia works and are 
reasonably tolerant of seeing that sort of thing from time to time.

But people expect that we should get the dates right, and they are right 
to ask that of us.

I understand that you're just making up a hypothetical, not saying that 
this is what is actually happening.  I'm just agreeing with this line of 
thinking that says, in essence, when we think about measuring 
vandalism, which is already hard enough, we also have to think about how 
damaging different kinds of vandalism actually are.


Greg, I think your email sounded a little negative at the start, but not 
so much further down.  I think you would join me heartily in being super 
grateful for people doing this kind of analysis.  Yes, some of it will 
be primitive and will suffer from the many difficulties.  But 
data-driven decisionmaking is a great thing, particularly when we are 
cognizant of the limitations of the data we're using.

I just didn't want anyone to get the idea (and I'm sure I'm reading you 
right) that you were opposed to people doing research. :-)

--Jimbo

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Gregory Kohs
While the time and effort that went into Robert Rohde's analysis is
certainly extensive, the outcomes are based on so many flawed assumptions
about the nature of vandalism and vandalism reversion, publicize at one's
peril the key finding of a 0.4% vandalism rate.

http://en.wikipedia.org/w/index.php?title=John_McCaindiff=169808394oldid=169720853
11 hours
Reverted with no tags.

http://en.wikipedia.org/w/index.php?title=Maria_Cantwelldiff=prevoldid=160400298
46 days
Reverted with note: Undid revision 160400298 by 75.133.82.218
By the way, there was a two-minute vandalism in the interim, so in many
cases, just because an analyst finds a recent and short incident, he or
she may be completely missing a longer-term incident.

http://en.wikipedia.org/w/index.php?title=Ted_Stevensdiff=prevoldid=170850508
There goes your rvv theory.  In this case, rvv was a flag for even more
preposterous vandalism.

The notion that these are lightly-watched or lightly-edited articles is a
bit difficult to swallow, since they are the biographical articles about
three United States senators.  These articles were analyzed by an
independent team of volunteers, and we found that the 100 senatorial
articles were in deliberate disrepair about 6.8% of the time, which would
vastly differ from Rohde's analysis.  Certainly, one could argue that
articles about political figures may be vandalized more often, but one might
also counter that argument with the assumption that more eyes ought to be
watching these articles and repairing them.  More detail here:

http://www.mywikibiz.com/Wikipedia_Vandalism_Study

Admittedly, there were some minor flaws with our study's methodology, too.
These are reviewed on the Discussion page.  But, as with Rohde's assessment,
if anything, we may have underrepresented the problem at 6.8%.

I remain unimpressed with Wikipedia's accuracy rate, and I am bewildered why
flagged revisions have not been implemented yet.

Greg
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Nathan
On Thu, Aug 20, 2009 at 12:59 PM, Gregory Kohs thekoh...@gmail.com wrote:

 While the time and effort that went into Robert Rohde's analysis is
 certainly extensive, the outcomes are based on so many flawed assumptions
 about the nature of vandalism and vandalism reversion, publicize at one's
 peril the key finding of a 0.4% vandalism rate.


 http://en.wikipedia.org/w/index.php?title=John_McCaindiff=169808394oldid=169720853
 11 hours
 Reverted with no tags.


The best part about that little exchange is:
http://en.wikipedia.org/w/index.php?title=John_McCaindiff=nextoldid=169906715

wherein a revert was made returning the vandalism, followed by another when
the editor noticed his error.

I don't think Robert made any firm conclusions on the meaning of his data;
he notes all the caveats that others have since emphasized, and admits to
likely underestimating vandalism. I read the 0.4% as representing the
approximate number of articles containing vandalism in an English Wikipedia
snapshot; that is quite different than the amount of time specific articles
stay in a vandalized state. Given the difficulty of accurately analyzing
this sort of data, no firm conclusions can be drawn; but certainly its more
informative than a Wikipedia Review analysis of a relatively small group of
articles in a specific topic area.

Nathan
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Gregory Maxwell
On Thu, Aug 20, 2009 at 12:46 PM, Jimmy Walesjwa...@wikia-inc.com wrote:
[snip]
 Greg, I think your email sounded a little negative at the start, but not
 so much further down.  I think you would join me heartily in being super
 grateful for people doing this kind of analysis.  Yes, some of it will
 be primitive and will suffer from the many difficulties.  But
 data-driven decisionmaking is a great thing, particularly when we are
 cognizant of the limitations of the data we're using.

 I just didn't want anyone to get the idea (and I'm sure I'm reading you
 right) that you were opposed to people doing research. :-)


Absolutely— No one who has done thing kind of analysis could fail to
appreciate the enormous amount of work that goes into even making a
couple of simple seemingly off the cuff numbers out of the mountain
of data that is Wikipedia.

Making sure the numbers are accurate and meaningful while also clearly
explaining the process of generating is in and of itself a large
amount of work, and my gratitude is extended to anyone who contributes
to those processes.

I've long been a loud proponent of data driven decision making. So I'm
absolutely not opposed to people doing research, but just as you said—
we need to be acutely aware of the limitations of the research.  Weak
data is clearly better than no data, but only when you are aware of
the strength of the data.  Or, in other words, knowing what you don't
know is often *the most critical* piece of information in any decision
making process.

In our eagerness to establish what we can and do know it can be easy
to forget how much we don't know. Some of the limitations which are
all too obvious to researchers are less than obvious to people who've
never personally done quantitative analysis on Wikipedia data, yet
many of these people are the decision makers that must do something
useful with the data. The casual language used when researchers write
for researchers can magnify misunderstandings.  It was merely my
intent to caution against the related risks.

I think the most impactful contributions available for researchers
today are less in the area of the direct research itself but are
instead in advancing the art of researching Wikipedia.  But the two go
hand in hand, we can't advance the art if we don't do the research.
The latter type is less sexy and not prone to generating headlines,
but it is work that will last and generate citations for a long time.
Measurements of X today will be soon forgotten as they are replaced by
later analysis of the historical data using superior techniques.

That my tone was somewhat negative is only due to my extreme
disappointment in that our own discussion of recent measurements has
been almost entirely devoid of critical analysis. Contributors patting
themselves on the back and saying I told you so! seem to be
outnumbering suggestions that the research might mean something else
entirely, though perhaps that is my own bias speaking.   To the extent
that I'm wrong about that I hope that my comments were merely
redundant, to the extent that I'm right I hope my points will invite
nuanced understanding of the research and encourage people to seek out
and expose potentially confounding variables and bad-proxies so that
all our knowledge can be advanced.

If this stuff were easy it would all be done already. Wikipedia
research is interesting because it is both hard and potentially
meaningful. There is room and need for contributions from everyone.

Cheers!

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


[Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Erik Zachte
There is another way to detect 100% reverts. It won't catch manual reverts
that are not 100 accurate but most vandal patrollers will use undo, and the
like.

 

For every revision calculate md5 checksum of content. Then you can easily
look back say 100 revisions to see whether this checksum occurred earlier.
It is efficient and unambiguous.

 

This will work for any Wikipedia for which a full archive dump is available.


 

Erik Zachte

 

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Gregory Kohs
Nathan said:

...but certainly its (sic) more informative than a Wikipedia Review
analysis of a relatively small group of articles in a specific topic area.

And you are certainly entitled to a flawed opinion based on incorrect
assumptions, such as ours being a Wikipedia Review analysis.  But, nice
try at a red herring argument.

Greg
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Andrew Gray
2009/8/20 Erik Zachte erikzac...@infodisiac.com:
 There is another way to detect 100% reverts. It won't catch manual reverts
 that are not 100 accurate but most vandal patrollers will use undo, and the
 like.

 For every revision calculate md5 checksum of content. Then you can easily
 look back say 100 revisions to see whether this checksum occurred earlier.
 It is efficient and unambiguous.

A slightly less effective method would be to use the page size in
bytes; this won't give the precise one-to-one matching, but as I
believe it's already calculated in the data it might well be quicker.

One other false positive here: edit warring where one or both sides is
using undo/rollback. You'll get the impression of a lot of vandalism
without there necessarily being any.

-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Brian
On Thu, Aug 20, 2009 at 11:23 AM, Erik Zachte erikzac...@infodisiac.comwrote:

 There is another way to detect 100% reverts. It won't catch manual reverts
 that are not 100 accurate but most vandal patrollers will use undo, and the
 like.



 For every revision calculate md5 checksum of content. Then you can easily
 look back say 100 revisions to see whether this checksum occurred earlier.
 It is efficient and unambiguous.



 This will work for any Wikipedia for which a full archive dump is
 available.




 Erik Zachte


Luca's WikiTrust could easily reveal this info.
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Nathan
On Thu, Aug 20, 2009 at 1:30 PM, Gregory Kohs thekoh...@gmail.com wrote:

 Nathan said:

 ...but certainly its (sic) more informative than a Wikipedia Review
 analysis of a relatively small group of articles in a specific topic area.

 And you are certainly entitled to a flawed opinion based on incorrect
 assumptions, such as ours being a Wikipedia Review analysis.  But, nice
 try at a red herring argument.

 Greg


Well, you can understand where I would get that idea - since the URL you
provided had Wikipedia Review members performing the research, until you
changed it a few minutes ago.

http://www.mywikibiz.com/index.php?title=Wikipedia_Vandalism_Studydiff=90806oldid=89479

My point (which might still be incorrect, of course) was that an analysis
based on 30,000 randomly selected pages was more informative about the
English Wikipedia than 100 articles about serving United States Senators.

Nathan
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Gregory Kohs
Apologies to Nathan regarding the Wikipedia Review description.  The
analysis team was, indeed, recruited via Wikipedia Review; however, almost
all of the participants in the research have now departed or reduced their
participation in Wikipedia Review to such a degree, I don't personally
consider it to have been a Wikipedia Review effort at all.  I allowed my
personal opinions to interfere with my recollection of the facts, though,
and that's not kosher.  Again, I hope you'll accept my apology.

I still maintain, however, that any study of the accuracy of or the
vandalized nature of Wikipedia content will be far more reliable and
meaningful if human assessment is the underlying mechanism of analysis,
rather than a bot or script that will simply tally up things.  I think
that Rohde's design was inherently flawed, and I'm happy that Greg Maxwell
and I both immediately recognized the danger of running off and reporting
the good news, as Sue Gardner was apparently ready to do immediately.

As I said, I feel that Rohde proceeded with research based on several highly
questionable assumptions, while the 100 Senators research rather carefully
outlined a research plan that carried very few assumptions, other than that
you trust the analysts to intelligently recognize vandalism or not.  Nathan,
by praising Rohde's work and disparaging my own, you seem to be suggesting
that you would prefer to live inside a giant mountain comprised of sticks
and twigs, rather than in a small, pleasantly furbished 12' x 12' room.  I
just don't understand that line of thinking.  I'd rather have a small bit of
reliable data based on a stable premise, rather than a giant pile of data
based on an unstable premise.

Greg
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Anthony
On Thu, Aug 20, 2009 at 1:55 PM, Nathan nawr...@gmail.com wrote:

 My point (which might still be incorrect, of course) was that an analysis
 based on 30,000 randomly selected pages was more informative about the
 English Wikipedia than 100 articles about serving United States Senators.


Any automated method of finding vandalism is doomed to failure.  I'd say its
informativeness was precisely zero.

Greg's analysis, on the other hand, was informative, but it was targeted at
a much different question than Robert's.

if one chooses a random page from Wikipedia right now, what is the
probability of receiving a vandalized revision  The best way to answer that
question would be with a manually processed random sample taken from a
pre-chosen moment in time.  As few as 1000 revisions would probably be
sufficient, if I know anything about statistics, but I'll let someone with
more knowledge of statistics verify or refute that.  The results will depend
heavily on one's definition of vandalism, though.

On Thu, Aug 20, 2009 at 12:38 PM, Jimmy Wales jwa...@wikia-inc.com wrote:

 Is there a possibility of re-running the numbers to include traffic
 weightings?


definitely should be done


 I would hypothesize from experience that if we adjust the random page
 selection to account for traffic (to get a better view of what people
 are actually seeing) we would see slightly different results.


I think we'd see drastically different results.


 I think we would see a lot less (percentagewise) vandalism that persists
 for a really long time for precisely the reason you identified: most
 vandalism that lasts a long time, lasts a long time because it is on
 obscure pages that no one is visiting.


Agreed.  On the other hand, I think we'd also see that pages with more
traffic are more likely to be vandalized.

Of course, this assumes a valid methodology.  Using admin rollback, the
undo
function, the revert bots, various editing tools, and commonly used
phrases like rv, rvv, etc. to find vandalism is heavily skewed toward
vandalism that doesn't last very long (or at least doesn't last very many
edits).  It's basically useless.
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Andrew Gray
2009/8/20 Gregory Maxwell gmaxw...@gmail.com:

 Going back to your simple study now:  The analysis of vandalism
 duration and its impact on readers makes an assumption about
 readership which we know to be invalid. You're assuming a uniform
 distribution of readership: That readers are just as likely to read
 any random article. But we know that the actual readership follows a
 power-law (long-tail) distribution. Because of the failure to consider
 traffic levels we can't draw conclusions on how much vandalism readers
 are actually exposed to.

We're also assuming a uniform distribution of vandalism, as it were.
There's a number of different types of vandalism; obscene defacement,
malicious alteration of factual content, meaningless test edits of a
character or two, schoolkids leaving messages for each other...

...and it all has a different impact on the reader.

This has two implications:

a) It seems safe to assume that replacing the entire article with
john is gay is going to get spotted and reverted faster, on average,
than an edit providing a plausible-sounding but entirely fictional
history for a small town in Kansas. So, any changes in the pattern of
the *content* of vandalism is going to lead to changes in the duration
and thus overall frequency of it, even if the amount of vandal edits
is constant.

b) We can easily compare the difference in effect for vandalism to be
left on differently trafficed pages for various times - roughly
speaking, time * traffic = number of readers affected. If some
vandalism is worse than others, we could thus also calculate some kind
of intensity metric - one hundred people viewing enormous genital
piercing images on [[Kitten]] is probably worse than ten thousand
people viewing asdfdfggfh at the end of a paragraph in the same
article.

I'm not sure how we'd go ahead with the second one, but it's an
interesting thing to think about.

-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Robert Rohde
On Thu, Aug 20, 2009 at 2:10 PM, Anthonywikim...@inbox.org wrote:
 On Thu, Aug 20, 2009 at 1:55 PM, Nathan nawr...@gmail.com wrote:

 My point (which might still be incorrect, of course) was that an analysis
 based on 30,000 randomly selected pages was more informative about the
 English Wikipedia than 100 articles about serving United States Senators.


 Any automated method of finding vandalism is doomed to failure.  I'd say its
 informativeness was precisely zero.

 Greg's analysis, on the other hand, was informative, but it was targeted at
 a much different question than Robert's.

 if one chooses a random page from Wikipedia right now, what is the
 probability of receiving a vandalized revision  The best way to answer that
 question would be with a manually processed random sample taken from a
 pre-chosen moment in time.  As few as 1000 revisions would probably be
 sufficient, if I know anything about statistics, but I'll let someone with
 more knowledge of statistics verify or refute that.  The results will depend
 heavily on one's definition of vandalism, though.

Only in dreadfully obvious cases can you look at a revision by itself
and know it contains vandalism.  If the goal is really to characterize
whether any vandalism has persisted in an article from any time in the
past, then one really needs to look at the full edit history to see
what has been changed / removed over time.

Even at the level of randomly sampling 1000 revisions, doing an real
evaluation of the full history is likely to be impractical for any
manual process.

If however you restrict yourself to asking whether 1000 edits
contributed vandalism, then you have a relatively manageable task, and
one that is more closely analogous to the technical program I set up.
If it helps one can think of what I did as trying to characterize
reverts and detect the persistence of new vandalism rather than
vandalism in general.  And of course, only new vandalism could be
fixed by an immediate rollback / revert anyway.

Qualitatively I tend to think that vandalism that has persisted
through many intervening revisions is in a rather different category
than new vandalism.  Since people rarely look at or are aware of an
articles' ancient past, such persistent vandalism is at that point
little different than any other error in an article.  It is something
to be fixed, but you won't usually be able to recognize it as a
malicious act.

 On Thu, Aug 20, 2009 at 12:38 PM, Jimmy Wales jwa...@wikia-inc.com wrote:

 Is there a possibility of re-running the numbers to include traffic
 weightings?


 definitely should be done

Does anyone have a nice comprehensive set of page traffic aggregated
at say a month level?  The raw data used by stats.grok.se, etc. is
binned hourly which opens one up to issues of short-term fluctuations,
but I'm not at all interested in downloading 35 GB of hourly files
just to construct my own long-term averages.

 I would hypothesize from experience that if we adjust the random page
 selection to account for traffic (to get a better view of what people
 are actually seeing) we would see slightly different results.


 I think we'd see drastically different results.

If I had to make a prediction, I'd expect one might see numerically
higher rates of vandalism and shorter average durations, but otherwise
qualitatively similar results given the same methodology.  I agree
though that it would be worth doing the experiment.

 I think we would see a lot less (percentagewise) vandalism that persists
 for a really long time for precisely the reason you identified: most
 vandalism that lasts a long time, lasts a long time because it is on
 obscure pages that no one is visiting.

 Agreed.  On the other hand, I think we'd also see that pages with more
 traffic are more likely to be vandalized.

 Of course, this assumes a valid methodology.  Using admin rollback, the
 undo
 function, the revert bots, various editing tools, and commonly used
 phrases like rv, rvv, etc. to find vandalism is heavily skewed toward
 vandalism that doesn't last very long (or at least doesn't last very many
 edits).  It's basically useless.

Yes, as I acknowledged above, new vandalism.  My personal interest
is also skewed in that direction.  If you don't like it and don't find
it useful, feel free to ignore me and/or do your own analysis.
Vandalism that has persisted through many revisions is a qualitatively
different critter than most new vandalism.  It's usually hard to
identify, even by a manual process, and is unlikely to be fixed except
through the normal editoral process of review, fact-checking, and
revision.  When vandalism is new people are at least paying
attention to it in particular, and all vandalism starts out that way.
Perhaps it would be more useful if you think of this work as a
characterization of revert statistics?

Anyway, I provided my data point and described what I did so others
could judge it for themselves.  Regardless of your opinion, it

Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Thomas Dalton
2009/8/20 Jimmy Wales jwa...@wikia-inc.com:
 Robert Rohde wrote:
 When one downloads a dump file, what percentage of the pages are
 actually in a vandalized state?

 This is equivalent to asking, if one chooses a random page from
 Wikipedia right now, what is the probability of receiving a vandalized
 revision?

 Is there a possibility of re-running the numbers to include traffic
 weightings?

I'd like to see that data too. I'm sure you are right, vandalism
doesn't last as long on popular pages, but it would be very
interesting to see how much quicker it is reverted and how popular a
page needs to be for that to apply (or if it is a gradual
improvement).

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Alex
Robert Rohde wrote:
 
 Does anyone have a nice comprehensive set of page traffic aggregated
 at say a month level?  The raw data used by stats.grok.se, etc. is
 binned hourly which opens one up to issues of short-term fluctuations,
 but I'm not at all interested in downloading 35 GB of hourly files
 just to construct my own long-term averages.
 

I don't have every article, but I have the data for July 09 for ~600,000
pages on enwiki (mostly articles). It also has the hit counts for
redirects aggregated with the article, not sure if that would be more or
less useful for you. Let me know if you want it, its in a MySQL table on
the toolserver right now.

-- 
Alex (wikipedia:en:User:Mr.Z-man)

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Anthony
On Thu, Aug 20, 2009 at 6:36 PM, Robert Rohde raro...@gmail.com wrote:

 On Thu, Aug 20, 2009 at 2:10 PM, Anthonywikim...@inbox.org wrote:
  if one chooses a random page from Wikipedia right now, what is the
  probability of receiving a vandalized revision  The best way to answer
 that
  question would be with a manually processed random sample taken from a
  pre-chosen moment in time.  As few as 1000 revisions would probably be
  sufficient, if I know anything about statistics, but I'll let someone
 with
  more knowledge of statistics verify or refute that.  The results will
 depend
  heavily on one's definition of vandalism, though.

 Only in dreadfully obvious cases can you look at a revision by itself
 and know it contains vandalism.  If the goal is really to characterize
 whether any vandalism has persisted in an article from any time in the
 past, then one really needs to look at the full edit history to see
 what has been changed / removed over time.


I wouldn't suggest looking at the edit history at all, just the most recent
revision as of whatever moment in time is chosen.  If vandalism is found,
then and only then would one look through the edit history to find out when
it was added.


  Of course, this assumes a valid methodology.  Using admin rollback, the
  undo
  function, the revert bots, various editing tools, and commonly used
  phrases like rv, rvv, etc. to find vandalism is heavily skewed
 toward
  vandalism that doesn't last very long (or at least doesn't last very many
  edits).  It's basically useless.

 Yes, as I acknowledged above, new vandalism.


New vandalism which has not yet been reverted wouldn't be included.


 My personal interest
 is also skewed in that direction.  If you don't like it and don't find
 it useful, feel free to ignore me and/or do your own analysis.


I do.  I also feel free to criticize your methods publicly, since you
decided to share them publicly.


 Anyway, I provided my data point and described what I did so others
 could judge it for themselves.  Regardless of your opinion, it
 addressed an issue of interest to me, and I would hope others also
 find some useful insight in it.


And I presented my criticism, which hopefully other will find some useful
insight in as well.
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Thomas Dalton
2009/8/20 Anthony wikim...@inbox.org:
 I wouldn't suggest looking at the edit history at all, just the most recent
 revision as of whatever moment in time is chosen.  If vandalism is found,
 then and only then would one look through the edit history to find out when
 it was added.

That only works if the article is very well referenced and you have
all the references and are willing to fact-check everything. Otherwise
you will miss subtle vandalism like changing the date of birth by a
year.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Anthony
On Thu, Aug 20, 2009 at 6:57 PM, Thomas Dalton thomas.dal...@gmail.comwrote:

 2009/8/20 Anthony wikim...@inbox.org:
  I wouldn't suggest looking at the edit history at all, just the most
 recent
  revision as of whatever moment in time is chosen.  If vandalism is found,
  then and only then would one look through the edit history to find out
 when
  it was added.

 That only works if the article is very well referenced and you have
 all the references and are willing to fact-check everything. Otherwise
 you will miss subtle vandalism like changing the date of birth by a
 year.


No need for the article to be referenced at all, but yes, it would be time
consuming, or at least person-time consuming.  On the other hand, it'd
answer the question, in a way that an automated process never could do
(assuming I've got my statistical analysis right, anyway:
http://www.raosoft.com/samplesize.html seems to suggest a 99% confidence
level for 664 random samples out of 3 million, but I'm not sure what
response distribution means).
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Robert Rohde
On Thu, Aug 20, 2009 at 3:57 PM, Thomas Daltonthomas.dal...@gmail.com wrote:
 2009/8/20 Anthony wikim...@inbox.org:
 I wouldn't suggest looking at the edit history at all, just the most recent
 revision as of whatever moment in time is chosen.  If vandalism is found,
 then and only then would one look through the edit history to find out when
 it was added.

 That only works if the article is very well referenced and you have
 all the references and are willing to fact-check everything. Otherwise
 you will miss subtle vandalism like changing the date of birth by a
 year.

It's not just facts.  There are many ways to degrade the qualify of an
article (such as removing entire sections) that would be invisible if
one looks at only one revision.

Anthony seems to be talking about a question of article accuracy
(unless I am misreading him).  That is overlapping issue with
addressing vandalism, but there are a significant number of ways to
commit vandalism that nonetheless have nothing to do with impairing
the resulting article's accuracy.

-Robert Rohde

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Thomas Dalton
2009/8/21 Anthony wikim...@inbox.org:
 On Thu, Aug 20, 2009 at 6:57 PM, Thomas Dalton thomas.dal...@gmail.comwrote:

 2009/8/20 Anthony wikim...@inbox.org:
  I wouldn't suggest looking at the edit history at all, just the most
 recent
  revision as of whatever moment in time is chosen.  If vandalism is found,
  then and only then would one look through the edit history to find out
 when
  it was added.

 That only works if the article is very well referenced and you have
 all the references and are willing to fact-check everything. Otherwise
 you will miss subtle vandalism like changing the date of birth by a
 year.


 No need for the article to be referenced at all, but yes, it would be time
 consuming, or at least person-time consuming.

You mean you could go and find references for the information
yourself? I suppose you could, but that is completely impractical.

On the other hand, it'd
 answer the question, in a way that an automated process never could do
 (assuming I've got my statistical analysis right, anyway:
 http://www.raosoft.com/samplesize.html seems to suggest a 99% confidence
 level for 664 random samples out of 3 million, but I'm not sure what
 response distribution means).

The site looks like it is for surveys made up of yes/no questions, I
don't think it is going to apply to this.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Anthony
On Thu, Aug 20, 2009 at 7:13 PM, Robert Rohde raro...@gmail.com wrote:

 On Thu, Aug 20, 2009 at 3:57 PM, Thomas Daltonthomas.dal...@gmail.com
 wrote:
  2009/8/20 Anthony wikim...@inbox.org:
  I wouldn't suggest looking at the edit history at all, just the most
 recent
  revision as of whatever moment in time is chosen.  If vandalism is
 found,
  then and only then would one look through the edit history to find out
 when
  it was added.
 
  That only works if the article is very well referenced and you have
  all the references and are willing to fact-check everything. Otherwise
  you will miss subtle vandalism like changing the date of birth by a
  year.

 It's not just facts.  There are many ways to degrade the qualify of an
 article (such as removing entire sections) that would be invisible if
 one looks at only one revision.


I guess that's true.  People could be removing facts, for instance, which
wouldn't be apparently by looking at one revision.  So such an analysis
would potentially understate actual vandalism.  But at least we'd know in
which direction the percentage is potentially wrong.  And anecdotally, I
don't think the understatement would be significant.

There's also the question of whether or not we want to count an article
which had a fact removed a few years ago and never re-added to be a
vandalized revision.

Anthony seems to be talking about a question of article accuracy
 (unless I am misreading him).


I'm attempting to best answer the question if one chooses a random page
from Wikipedia right now, what is the probability of receiving a
vandalized revision, which I take to have nothing whatsoever to do with the
number of reverts.


 That is overlapping issue with
 addressing vandalism, but there are a significant number of ways to
 commit vandalism that nonetheless have nothing to do with impairing
 the resulting article's accuracy.


Significant number?  I can only think of a handful.
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Anthony
On Thu, Aug 20, 2009 at 7:20 PM, Thomas Dalton thomas.dal...@gmail.comwrote:

 2009/8/21 Anthony wikim...@inbox.org:
  On Thu, Aug 20, 2009 at 6:57 PM, Thomas Dalton thomas.dal...@gmail.com
 wrote:
 
  2009/8/20 Anthony wikim...@inbox.org:
   I wouldn't suggest looking at the edit history at all, just the most
  recent
   revision as of whatever moment in time is chosen.  If vandalism is
 found,
   then and only then would one look through the edit history to find out
  when
   it was added.
 
  That only works if the article is very well referenced and you have
  all the references and are willing to fact-check everything. Otherwise
  you will miss subtle vandalism like changing the date of birth by a
  year.
 
 
  No need for the article to be referenced at all, but yes, it would be
 time
  consuming, or at least person-time consuming.

 You mean you could go and find references for the information
 yourself? I suppose you could, but that is completely impractical.


My God.  If a few dozen people couldn't easily determine to a relatively
high degree of certainty what portion of a mere 0.03% of Wikipedia's
articles are *vandalized*, how useless is Wikipedia?

On the other hand, it'd
  answer the question, in a way that an automated process never could do
  (assuming I've got my statistical analysis right, anyway:
  http://www.raosoft.com/samplesize.html seems to suggest a 99% confidence
  level for 664 random samples out of 3 million, but I'm not sure what
  response distribution means).

 The site looks like it is for surveys made up of yes/no questions, I
 don't think it is going to apply to this.


Is this article vandalized? is a yes/no question...
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Thomas Dalton
2009/8/21 Anthony wikim...@inbox.org:
 My God.  If a few dozen people couldn't easily determine to a relatively
 high degree of certainty what portion of a mere 0.03% of Wikipedia's
 articles are *vandalized*, how useless is Wikipedia?

I never said they couldn't. I said they couldn't do it by just looking
at the most recent revision.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Thomas Dalton
2009/8/21 Anthony wikim...@inbox.org:
 Is this article vandalized? is a yes/no question...

True, but that isn't actually the question that this research tried to
answer. It tried to answer How much time has this article spent in a
vandalised state?. If we are only interested in whether the most
recent revision is vandalised then that is a simpler problem but would
require a much larger sample to get the same quality of data.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Robert Rohde
On Thu, Aug 20, 2009 at 4:37 PM, Anthonywikim...@inbox.org wrote:
 On Thu, Aug 20, 2009 at 7:13 PM, Robert Rohde raro...@gmail.com wrote:

 On Thu, Aug 20, 2009 at 3:57 PM, Thomas Daltonthomas.dal...@gmail.com
 wrote:
  2009/8/20 Anthony wikim...@inbox.org:
  I wouldn't suggest looking at the edit history at all, just the most
 recent
  revision as of whatever moment in time is chosen.  If vandalism is
 found,
  then and only then would one look through the edit history to find out
 when
  it was added.
 
  That only works if the article is very well referenced and you have
  all the references and are willing to fact-check everything. Otherwise
  you will miss subtle vandalism like changing the date of birth by a
  year.

 It's not just facts.  There are many ways to degrade the qualify of an
 article (such as removing entire sections) that would be invisible if
 one looks at only one revision.


 I guess that's true.  People could be removing facts, for instance, which
 wouldn't be apparently by looking at one revision.  So such an analysis
 would potentially understate actual vandalism.  But at least we'd know in
 which direction the percentage is potentially wrong.  And anecdotally, I
 don't think the understatement would be significant.

You seem to be identifying all errors with vandalism.  Sometimes
factual errors are simply unintentional mistakes.  I agree that
accuracy is important, but I think you are thinking about the question
somewhat differently than I am.

snip

 I'm attempting to best answer the question if one chooses a random page
 from Wikipedia right now, what is the probability of receiving a
 vandalized revision, which I take to have nothing whatsoever to do with the
 number of reverts.

Let me describe the issue differently.  The practical issue I am
concerned with might be better expressed as the following:  For any
given article, what is the probability that the current revision is
not the best available revision (i.e. most accurate, most complete,
etc.)  Vandalism, in general, takes a page and makes it worse.  I am
interested in the problem of characterizing how often this happens
with an eye to being able to go back to that prior better version.
(This also explains why I am less interested in vandalism that
persists through many revisions.  Once that occurs, it makes less
sense to try and go back to the pre-vandalized revision.)

Your concern for establishing overall article accuracy is a good one,
but it is largely orthogonal to my interest in figuring out whether
the current revision is likely to be better or worse than the
revisions that came before it.

-Robert Rohde

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Anthony
On Thu, Aug 20, 2009 at 7:54 PM, Thomas Dalton thomas.dal...@gmail.comwrote:

 2009/8/21 Anthony wikim...@inbox.org:
  Is this article vandalized? is a yes/no question...

 True, but that isn't actually the question that this research tried to
 answer. It tried to answer How much time has this article spent in a
 vandalised state?.


When one downloads a dump file, what percentage of the pages are
actually in a vandalized state?

This is equivalent to asking, if one chooses a random page from Wikipedia
right now, what is the probability of receiving a vandalized revision?

That's the question I was referring to.


 If we are only interested in whether the most
 recent revision is vandalised then that is a simpler problem but would
 require a much larger sample to get the same quality of data.


How much larger?  Do you know anything about this, or you're just guessing?
 The number of random samples needed for a high degree of confidence tends
to be much much less than most people suspect.  That much I know.

I found one problem with my use of http://www.raosoft.com/samplesize.html

http://www.raosoft.com/samplesize.htmlI was specifying a margin of error
of 5%.  But that's an absolute margin of error.  So if it were 0.2%
vandalism, that'd be 0.2% plus or minus 5%.  Obviously unacceptable.

However, the response distribution would then be 0.2%.  This still would
require 7649 samples for a 95% confidence plus or minus 0.1%.  If the
vandalism turned out to be more prevalent though, and I suspect it would, we
could for instance be 95% confident plus or minus 0.5% if the response
distribution was 0.5% and we had 765 samples.
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Anthony
On Thu, Aug 20, 2009 at 7:58 PM, Robert Rohde raro...@gmail.com wrote:

 You seem to be identifying all errors with vandalism.


How so?


 Sometimes factual errors are simply unintentional mistakes.


Obviously we can't know the intent of the person for sure, but after a
mistake is found it's relatively simple to find where it was added and
decide whether or not we are going to call it vandalism.  This is an
inherent problem with answering the question.  If you can't determine it
manually, you sure as hell won't be able to determine it using automated
methods.


 Let me describe the issue differently.  The practical issue I am
 concerned with might be better expressed as the following:  For any
 given article, what is the probability that the current revision is
 not the best available revision (i.e. most accurate, most complete,
 etc.)  Vandalism, in general, takes a page and makes it worse.  I am

interested in the problem of characterizing how often this happens
 with an eye to being able to go back to that prior better version.
 (This also explains why I am less interested in vandalism that
 persists through many revisions.  Once that occurs, it makes less
 sense to try and go back to the pre-vandalized revision.)


*nod*.  Yes, we certainly have different things we're interested in
measuring.  If someone vandalizes an article, say to change the population
of a country from 3 million to 2.9 million, and then 20 other people improve
the article without fixing that fact, I'd still count that as vandalized.

On the other hand, are you sure you don't want to add an indisputably
before not the best available revision?  I mean, I'd say Wikipedia is
probably in the double digit percentages, at least in terms of popular
articles, if you don't add indisputably.
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Gregory Kohs
Riddle me this...

Is the edit below vandalism?

http://en.wikipedia.org/w/index.php?title=Arch_Coaldiff=255482597oldid=255480884

Did the edit take a page and make it worse?  Or, did it make the page a
better available revision than the version immediately prior to it?

Methinks the Wikipedia community has a long way to go in learning to
differentiate between a better encyclopedia and a worse encyclopedia
before we take the step to try to define vandalism.  Then, after we've done
all that, there might be some remaining value in trying to quantify
vandalism, as we've defined it.

Until then, for God's sake, Sue Gardner, do not gleefully run off
publicizing that only 0.4% of Wikipedia's articles are vandalized.

Greg
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Mark Wagner
On Thu, Aug 20, 2009 at 14:10, Anthonywikim...@inbox.org wrote:
 On Thu, Aug 20, 2009 at 1:55 PM, Nathan nawr...@gmail.com wrote:

 My point (which might still be incorrect, of course) was that an analysis
 based on 30,000 randomly selected pages was more informative about the
 English Wikipedia than 100 articles about serving United States Senators.


 Any automated method of finding vandalism is doomed to failure.  I'd say its
 informativeness was precisely zero.

 Greg's analysis, on the other hand, was informative, but it was targeted at
 a much different question than Robert's.

 if one chooses a random page from Wikipedia right now, what is the
 probability of receiving a vandalized revision  The best way to answer that
 question would be with a manually processed random sample taken from a
 pre-chosen moment in time.  As few as 1000 revisions would probably be
 sufficient, if I know anything about statistics, but I'll let someone with
 more knowledge of statistics verify or refute that.  The results will depend
 heavily on one's definition of vandalism, though.

I did this in an informal fashion in 2005 during my hundred article
surveys.  Of the 503 pages I looked at, only one was clearly
vandalized the first time I looked at it, so I'd say a thousand
samples is probably too small to get any sort of precision on the
vandalism rate.

-- 
Mark Wagner
[[User:Carnildo]]

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Anthony
On Thu, Aug 20, 2009 at 9:30 PM, Mark Wagner carni...@gmail.com wrote:

 On Thu, Aug 20, 2009 at 14:10, Anthonywikim...@inbox.org wrote:
  if one chooses a random page from Wikipedia right now, what is the
  probability of receiving a vandalized revision  The best way to answer
 that
  question would be with a manually processed random sample taken from a
  pre-chosen moment in time.  As few as 1000 revisions would probably be
  sufficient, if I know anything about statistics, but I'll let someone
 with
  more knowledge of statistics verify or refute that.  The results will
 depend
  heavily on one's definition of vandalism, though.

 I did this in an informal fashion in 2005 during my hundred article
 surveys.  Of the 503 pages I looked at, only one was clearly
 vandalized the first time I looked at it, so I'd say a thousand
 samples is probably too small to get any sort of precision on the
 vandalism rate.


Why?  My understanding is that, if your methodology was correct, you can say
with 96% confidence that the percentage of vandalized articles is less than
0.6%.  That's useful.  With 1000 samples, if you found two instances of
vandalism, you'd have a 97% confidence that the percentage of vandalized
articles is less than 0.5%.

I don't think it's that low, but if you publish the details of your hundred
article surveys, I might be persuaded that it is.

If we really do have that figure to that level of assurance, a more useful
statistic would be the percentage of pageviews that result in a vandalized
article.  That could be arrived at by weighting by pageviews while choosing
your random sample.

One flaw I found in my proposed methodology is that the moment in time
needs to be randomized, since certain times of the day/week/year might very
well experience higher vandalism than others.
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Gregory Kohs
Phil Nash wrote:

Many editors undo and revert on the basis of felicity of language and
emphasis, and unless it becomes an issue is an epiphenomenon of the
encyclopedia that anyone can edit. so I can't see how this is a good
example of anything in particular.

And, with point proven, I rest my case.

Greg
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Gregory Kohs
And here is where many of the flaws of the University of Minnesota study
were exposed:

http://chance.dartmouth.edu/chancewiki/index.php/Chance_News_31#The_Unbreakable_Wikipedia.3F

Their methodology of tracking the persistence of words was questionable, to
say the least.

And here was my favorite part:

*We exclude anonymous editors from some analyses, because IPs are not
stable: multiple edits by the same human might be recorded under different
IPs, and multiple humans can share an IP.*

So, in a study evaluating the damaged views within 34 trillion edits, they
excluded the 9 trillion edits by IP addresses?  If you're not laughing right
now, then you must be new to Wikipedia.

Greg
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] How much of Wikipedia is vandalized? 0.4% of Articles

2009-08-20 Thread Anthony
On Thu, Aug 20, 2009 at 11:02 PM, Gregory Kohs thekoh...@gmail.com wrote:

 And here was my favorite part:

 *We exclude anonymous editors from some analyses, because IPs are not
 stable: multiple edits by the same human might be recorded under different
 IPs, and multiple humans can share an IP.*


I have to say that this one was better:  We believe it is reasonable to
assume that essentially all damage is repaired within 15 revisions.  Talk
about begging the question.
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l