Hi Simon,
Thanks for reaching out :)

I tried a similar analysis on our cluster with the same original files as
the ones in dumps.wikimedia.org, using Spark to speed up computation.
I ended up with coherent results for both the examples you gave:

Sum - count Data

date Climate_change --> Global_warming Global_warming --> Climate_change *Total
Result*
2017-11 3904 950 *4854*
2017-12 3549 780 *4329*
2018-01 4508 1011 *5519*
2018-02 3548 998 *4546*
2018-03 3462 745 *4207*
2018-04 3726 755 *4481*
2018-05 3730 810 *4540*
2018-06 2971 862 *3833*
2018-07 3500 1602 *5102*
2018-08 4546 1644 *6190*
2018-09 3962 1472 *5434*
2018-10 6155 3048 *9203*
2018-11 5865 2617 *8482*
2018-12 5491 2227 *7718*
2019-01 5774 2911 *8685*
2019-02 6311 2845 *9156*
2019-03 6858 2514 *9372*
2019-04 6824 2199 *9023*


Sum - count Data

date Air_pollution --> Smog Smog --> Air_pollution *Total Result*
2017-11 82 263 *345*
2017-12 200 184 *384*
2018-01 65 140 *205*
2018-02 82 98 *180*
2018-03 418 149 *567*
2018-04 295 137 *432*
2018-05 215 95 *310*
2018-06 245 85 *330*
2018-07 233 70 *303*
2018-08 36 62 *98*
2018-09 45 81 *126*
2018-10 66 96 *162*
2018-11 128 135 *263*
2018-12 50 90 *140*
2019-01 68 92 *160*
2019-02 50 68 *118*
2019-03 49 72 *121*
2019-04 33 51 *84*
*Total Result* *2360* *1968* *4328*

Maybe there is an issue in the way you process the data?
Best
Joseph




On Mon, May 13, 2019 at 3:38 PM Simon Munzert <[email protected]>
wrote:

> Hi all,
>
> I've got a question on the completeness of the clickstream dataset. I
> downloaded the dumps for 2018 from
> https://dumps.wikimedia.org/other/clickstream/ (English Wikipedia only).
> When I filter for the article pair "Climate change" and "Global warming"
> (either one being either prev or curr) for all of 2018, this is what I get:
>
>   prev           curr           type      n month
>   <chr>          <chr>          <chr> <dbl> <chr>
> 1 Global_warming Climate_change link    755 2018-04
> 2 Global_warming Climate_change link    810 2018-05
> 3 Climate_change Global_warming link   3730 2018-05
> 4 Climate_change Global_warming link   3962 2018-09
> 5 Climate_change Global_warming link   5865 2018-11
> 6 Climate_change Global_warming link   5491 2018-12
> 7 Global_warming Climate_change link   2227 2018-12
>
> The visit numbers seem plausible. But why is there no data on, e.g.,
> January to March? And why is there data for both directions in May and
> December, but not for the others? This seems implausible given the
> popularity of the articles.
>
> Here's another example:
>
>   prev          curr          type      n month
>   <chr>         <chr>         <chr> <dbl> <chr>
> 1 Smog          Air_pollution link    140 2018-01
> 2 Air_pollution Smog          link     82 2018-02
> 3 Air_pollution Smog          link    295 2018-04
> 4 Air_pollution Smog          link    215 2018-05
> 5 Smog          Air_pollution link     85 2018-06
> 6 Air_pollution Smog          link    233 2018-07
> 7 Air_pollution Smog          link     45 2018-09
> 8 Smog          Air_pollution link     96 2018-10
> 9 Smog          Air_pollution link     90 2018-12
>
> Am I missing something here?
>
> Thanks in advance,
> Simon
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>


-- 
Joseph Allemandou (joal) (he / him)
Sr Data Engineer
Wikimedia Foundation
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to