Hi all,

I've got a question on the completeness of the clickstream dataset. I 
downloaded the dumps for 2018 from 
https://dumps.wikimedia.org/other/clickstream/ (English Wikipedia only). When I 
filter for the article pair "Climate change" and "Global warming" (either one 
being either prev or curr) for all of 2018, this is what I get: 

  prev           curr           type      n month  
  <chr>          <chr>          <chr> <dbl> <chr>  
1 Global_warming Climate_change link    755 2018-04
2 Global_warming Climate_change link    810 2018-05
3 Climate_change Global_warming link   3730 2018-05
4 Climate_change Global_warming link   3962 2018-09
5 Climate_change Global_warming link   5865 2018-11
6 Climate_change Global_warming link   5491 2018-12
7 Global_warming Climate_change link   2227 2018-12

The visit numbers seem plausible. But why is there no data on, e.g., January to 
March? And why is there data for both directions in May and December, but not 
for the others? This seems implausible given the popularity of the articles.

Here's another example:

  prev          curr          type      n month  
  <chr>         <chr>         <chr> <dbl> <chr>  
1 Smog          Air_pollution link    140 2018-01
2 Air_pollution Smog          link     82 2018-02
3 Air_pollution Smog          link    295 2018-04
4 Air_pollution Smog          link    215 2018-05
5 Smog          Air_pollution link     85 2018-06
6 Air_pollution Smog          link    233 2018-07
7 Air_pollution Smog          link     45 2018-09
8 Smog          Air_pollution link     96 2018-10
9 Smog          Air_pollution link     90 2018-12

Am I missing something here?

Thanks in advance,
Simon
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to