Hi Rawia,

The metawiki page you link to counts everything defined as a "content page"
by that wiki. I believe the definition is that it has to be in the main
namespace (so an article; not a discussion page, image file, etc) and that
it has to have at least one valid internal link (so it can't just be a wall
of unformatted text). This will also exclude drafts, as Pine notes. Stubs
*are* included in all definitions (which is a good thing, because our stub
tracking is abysmal)

If you use http://stats.wikimedia.org/EN/Sitemap.htm you can get slightly
outdated article counts (but with a long historic tail). This uses a
definition of "article count" which is a little more generous, and counts
all pages in the main namespace. It is probably a better one for your
purposes as it's less liable to change.

There are currently 51 projects above the 100k threshold according to
wikistats; this includes Simple English, Latin, Volapuk and Esperanto,
which you may not want to count! Some very small languages with large
article counts may have a very high proportion of auto-generated articles -
there's been some research done on this but I can't immediately put my
finger on it. See, eg, this discussion:
https://lists.wikimedia.org/pipermail/analytics/2015-January/003214.html

As for language codes, I believe any two-letter code is a valid ISO 639-1
code, and (almost?) all three-letter codes are valid ISO 639-2. There are
about a dozen others which will need mapped by hand. Note that Norwegian
appears twice (nn, no).

Andrew.


On 21 January 2015 at 08:47, Abdel Samad, Rawia <
[email protected]> wrote:

>  Hello,
>
>
>
> I work for a consulting firm called Strategy&. We have been engaged by
> Facebook on behalf of Internet.org to conduct a study on assessing the
> state of connectivity globally. One key area of focus is the availability
> of relevant online content. We are using a the availability of encyclopedic
> knowledge in one’s primary language as a proxy for relevant content. We
> define this as 100K+ Wikipedia articles in one’s primary language. We have
> a few questions related to this analysis prior to publishing it:
>
> ·         We are currently using the article count by language based on
> Wikimedia’s foundation public link: Source:
> http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable
> source for article count – does it include stubs?
>
> ·         Is it possible to get historic data for article count. It would
> be great to monitor the evolution of the metric we have defined over time?
>
> ·         What are the biggest drivers you’ve seen for step change in the
> number of articles (e.g., number of active admins, machine translation,
> etc.)
>
> ·         We had to map Wikipedia language codes to ISO 639-3 language
> codes in Ethnologue (source we are using for primary language data). The 2
> language code for a wikipedia language in the “List of Wikipedias”
> sometimes matches but not always the ISO 639-1 code. Is there an easy way
> to do the mapping?
>
>
>
> Many Thanks,
>
> Rawia
>
>
>
>
> [image: Description: Strategy& Logo]
>
> *Formerly Booz & Company*
>
>
>
> *Rawia Abdel Samad*
>
> Direct: +9611985655 | Mobile: +97455153807
>
> Email: [email protected]
>
> www.strategyand.com
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
- Andrew Gray
  [email protected]
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to