Hi Rawia, The metawiki page you link to counts everything defined as a "content page" by that wiki. I believe the definition is that it has to be in the main namespace (so an article; not a discussion page, image file, etc) and that it has to have at least one valid internal link (so it can't just be a wall of unformatted text). This will also exclude drafts, as Pine notes. Stubs *are* included in all definitions (which is a good thing, because our stub tracking is abysmal)
If you use http://stats.wikimedia.org/EN/Sitemap.htm you can get slightly outdated article counts (but with a long historic tail). This uses a definition of "article count" which is a little more generous, and counts all pages in the main namespace. It is probably a better one for your purposes as it's less liable to change. There are currently 51 projects above the 100k threshold according to wikistats; this includes Simple English, Latin, Volapuk and Esperanto, which you may not want to count! Some very small languages with large article counts may have a very high proportion of auto-generated articles - there's been some research done on this but I can't immediately put my finger on it. See, eg, this discussion: https://lists.wikimedia.org/pipermail/analytics/2015-January/003214.html As for language codes, I believe any two-letter code is a valid ISO 639-1 code, and (almost?) all three-letter codes are valid ISO 639-2. There are about a dozen others which will need mapped by hand. Note that Norwegian appears twice (nn, no). Andrew. On 21 January 2015 at 08:47, Abdel Samad, Rawia < [email protected]> wrote: > Hello, > > > > I work for a consulting firm called Strategy&. We have been engaged by > Facebook on behalf of Internet.org to conduct a study on assessing the > state of connectivity globally. One key area of focus is the availability > of relevant online content. We are using a the availability of encyclopedic > knowledge in one’s primary language as a proxy for relevant content. We > define this as 100K+ Wikipedia articles in one’s primary language. We have > a few questions related to this analysis prior to publishing it: > > · We are currently using the article count by language based on > Wikimedia’s foundation public link: Source: > http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable > source for article count – does it include stubs? > > · Is it possible to get historic data for article count. It would > be great to monitor the evolution of the metric we have defined over time? > > · What are the biggest drivers you’ve seen for step change in the > number of articles (e.g., number of active admins, machine translation, > etc.) > > · We had to map Wikipedia language codes to ISO 639-3 language > codes in Ethnologue (source we are using for primary language data). The 2 > language code for a wikipedia language in the “List of Wikipedias” > sometimes matches but not always the ISO 639-1 code. Is there an easy way > to do the mapping? > > > > Many Thanks, > > Rawia > > > > > [image: Description: Strategy& Logo] > > *Formerly Booz & Company* > > > > *Rawia Abdel Samad* > > Direct: +9611985655 | Mobile: +97455153807 > > Email: [email protected] > > www.strategyand.com > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- - Andrew Gray [email protected]
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
