Re: [Analytics] [Languages] [Wikimedia-l] Wikipedia article per speaker

Amir E. Aharoni Sun, 14 Jun 2015 08:05:51 -0700

Wonderful work, Miloš.

Some notes on edit count:
1. Some Wikipedias import all the versions of a translated article because
they believe that it's required for attribution (AFAIK it isn't). This, of
course, inflates the edit count in a completely artificial way, and sadly I
don't know how to filter this chaff.


2. Bot edits could probably be filtered out, but there are some very
different types of bots and it should be taken into account when measuring
community success. Some bots just create articles (Waray, Swedish, Dutch).
Some fix interlanguage links (not any longer, but it was huge everywhere
before 2013). Some auto-fix spelling, and it's a sign of a healthy
community (Hebrew, Catalan, and some others). Some are smarter than
AbuseFilter at reverting vandalism, and that's also a good sign.

3. Some sysops delete revisions with vandalism, which could simply be
reverted. I don't know how prevalent it is. More generally, deleted
revisions could probably be counted in a useful way as part of this project.
בתאריך 14 ביוני 2015 14:14,‏ "Milos Rancic" <mill...@gmail.com> כתב:

> I started writing a longer email, but then realized that it's better
> to stick with the most important points, as everything is anyway
> enough complex. Thus, just metrics and its applications, not anything
> else.
>
> While I was reloading a year ago my few years old idea to open
> Wikipedia in 3000 more languages, I realized that we have substantial
> problem. The most numerous communities have ~100 active (thus 5+
> edits/month) editors per million of speakers. As my hypothesis was
> that we could have Wikipedias in languages spoken by more than 10,000
> people, that would mean that at the best they would have 1 (one)
> active editor. Thus, something else has to be done... But before that,
> we have to gather data and have the idea what's that "something".
>
> My first idea -- something of a kind between a desperate one and "we
> should try something" -- was to ask people from Wikimedia Estonia,
> Wikimedia Finland and Wikimedia UK to try to reach as many as possible
> new active users on particular projects. The point is that Scottish
> Gaelic, Estonian and Finish are among the top in active users per
> million of speakers.
>
> A year later, Estonians are doing a very good job (others are good, as
> well). They are above 100 active users per million of speakers and in
> a couple of years they could reach even a couple of hundreds.
>
> But, there is an obvious flaw in this kind of reasoning and I was
> aware of it from the beginning: It's about languages spoken i rich
> countries, while we'll be dealing with the communities on the opposite
> end of wealth. However, at least it's possible to increase relative
> number of active users in "ideal" situations, which means that ~100
> active users per million of speakers is not a kind of realistic
> maximum.
>
> Thanks to the project Wiktionary meets Matica srpska, I am getting now
> more precise insights into Ethnologue data (don't ask me what's the
> relation, it was a couple of paragraphs long explanation inside of the
> email I didn't send).
>
> So, a month ago or so I got the first data and the news were very
> good: more than 5000 languages won't die during the next 100 years.
> More than 2500 languages are in very good shape. If we take for
> granted that Ethnologue's data are about languages.
>
> In the meantime, Sylvian mentioned on Languages list that he is
> working on Kichwa Wikipedia. And he noted one important thing: if we
> are going to have Wikipedias in languages like Kichwa is -- and that's
> likely the prototype for the most of the languages which we will meet
> in the future -- we have to adapt to them, not to impose unrealistic
> expectations to them. That's connected to the data, as I want to know
> what we could expect from them. (A note to self: literacy rate is very
> important parameter, as well.)
>
> It is also important to be able to follow numerically the development
> of particular community and give them know-how based on previous
> successful experiences.
>
> As we got more results from Ethnologue data, my ambitions raised. Of
> course I wanted to get number of articles per speaker. I got an
> approximate correlation between Wikipedia editions and Ethnologue
> data. Yes, of course, I knew that there are Wikipedia editions with a
> lot of bot-generated articles. So, I've cut data to languages with 5
> or more on Ethnologue language vitality scale and with the condition
> that the language has to have native speakers and I've got pretty sane
> results. Yes, Dutch and Swedish Wikipedias include a lot of
> bot-generated articles, but the number of articles in those langauges
> are quite fine in comparison with the rest of the projects.
>
> There are few arguments in favor of counting (even bot-generated) articles:
> * First, the most important flaw in analyzing such data is taking
> their synchrony, not the development. But synchrony is the starting
> point. By looking into development, we could monitor the number of new
> articles per month and we could easily conclude what's the normal
> state of the community and what's not.
> * Then it doesn't take a lot of efforts to create legitimate
> information on some of the topics by using bots. If legitimate
> articles, that gives us a clue about the capacity of particular
> community to create articles and thus spread free knowledge.
> * For example, if organized properly, it's not hard to create sane
> articles based on English (or Spanish or whichever) Wikipedia
> templates about actors and movies. That means that English (or Spanish
> or whichever) Wikipedia raises capacity of other Wikipedia editions,
> which is legitimate and quite relevant. It's relevant in the sense
> that we should particularly care about languages with large number of
> L2 speakers and languages used as international or regional lingua
> franca. In reverse note, we could conclude which languages have
> potential to create a lot of articles thanks to the fact that the
> speakers of that language are fluent in one of the big languages.
> That's also quite relevant for "gross capacity" to share knowledge in
> their own language.
> * The number of possible articles will always raise. Even for
> bot-generated articles. (Take as an example newly discovered planets
> outside of our solar system. For monolinguals, it's relevant to have
> that kind of information in their native language.) Thus,
> possibilities will raise and it's important to monitor capacities of
> the communities. Having a programmer raises capacity, obviously.
> Having a dexterous community member, capable to find a programmer
> inside of the movement willing to help creating a bot also counts.
>
> I've seen projects with a lot of edits and disproportionally small
> number of articles. From my perspective, it's better to have more
> articles than to have a lot of rollbacks and a lot of talk. Although
> the community itself is our most important value, our main task is to
> create articles, not to argue. Besides the fact that it could be a
> sign of bad community health.
>
> But there are many other possible indicators, which could work in the
> most of the cases. For example, edit count. From the first five
> projects by the number of articles, we could easily conclude that the
> ranks are: (1) English, (2) German, (3) French, (4) Dutch, (5)
> Swedish, not (1) English, (2) Swedish, (3) Dutch, (4) German, (5)
> French. (By taking a look into the other Wikipedias, we could see that
> even Chinese on 15th place is stronger than the Swedish Wikipedia on
> 2nd one.)
>
> Not counting English as world's primary lingua franca, It's also
> interesting to see that the edits per German and French speaker is
> roughly 1.5, while 0.6 in Russian case. Danish is ~1.7, Polish is
> ~1.05, Serbian is ~1.2, but Japanese is ~0.4 and Swahili ~0.05. (I
> made approximations without a calculator, thus error range is likely
> +-10% :) ) Thus, GDP/PPP per capita doesn't need to be that important
> factor (in the sense "if you reach particular GDP/PPP per capita, it's
> not anymore important factor"), while other things could be.
>
> It's also important to have in mind that various data are likely
> exposing various issues. And every issue has to be analyzed from
> socio-economic perspective (obviously, Japanese Wikipedia is not
> relatively weak because of the same reason as Russian or Swahili
> Wikipedia are).
>
> I will include as many parameters as possible in the future analysis.
> As I have now the number of speakers of particular language per
> country, it is possible now to correlate economic development with
> particular language.
>
> On Jun 13, 2015 09:38, "Federico Leva (Nemo)" <nemow...@gmail.com> wrote:
> >
> > Asaf Bartov, 13/06/2015 02:42:
> >>
> >> The (already existing) metric of active-editors-per-million-speakers is,
> >> it seems to me, a far more robust metric.  Erik Z.'s
> stats.wikimedia.org
> >> <http://stats.wikimedia.org> is offering that metric.
> >
> >
> > I personally agree on this in general, but Millosh is trying something
> different in his current quest, i.e. content ingestion and content coverage
> assessment, also for missing language subdomains. (By the way, I created
> the category, please add stuff:
> https://meta.wikimedia.org/wiki/Category:Content_coverage .)
> >
> > Mere article count tells us very little and he acknowledged it. As you
> added analytics: maybe when https://phabricator.wikimedia.org/T44259 is
> fixed we can also do fancy things like join various tables and count
> (countable) articles above a minimum threshold of hits, or something like
> that.
> >
> > Oh, and the total number of internal links in a wiki is also an
> interesting metric in many cases: they're often a good indicator of how
> curated a wiki globally is, while bot-created articles are often orphan.
> (Locally there might be overlinking but that's rarely a wiki-wide issue.) I
> don't remember how reliable the WikiStats numbers are, but they often give
> a good clue already.
> >
> > Nemo
> >
> > _______________________________________________
> > Languages mailing list
> > langua...@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/languages
>
> _______________________________________________
> Languages mailing list
> langua...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/languages
>

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Languages] [Wikimedia-l] Wikipedia article per speaker

Reply via email to