Hi Simon, Thanks for reaching out :) I tried a similar analysis on our cluster with the same original files as the ones in dumps.wikimedia.org, using Spark to speed up computation. I ended up with coherent results for both the examples you gave:
Sum - count Data date Climate_change --> Global_warming Global_warming --> Climate_change *Total Result* 2017-11 3904 950 *4854* 2017-12 3549 780 *4329* 2018-01 4508 1011 *5519* 2018-02 3548 998 *4546* 2018-03 3462 745 *4207* 2018-04 3726 755 *4481* 2018-05 3730 810 *4540* 2018-06 2971 862 *3833* 2018-07 3500 1602 *5102* 2018-08 4546 1644 *6190* 2018-09 3962 1472 *5434* 2018-10 6155 3048 *9203* 2018-11 5865 2617 *8482* 2018-12 5491 2227 *7718* 2019-01 5774 2911 *8685* 2019-02 6311 2845 *9156* 2019-03 6858 2514 *9372* 2019-04 6824 2199 *9023* Sum - count Data date Air_pollution --> Smog Smog --> Air_pollution *Total Result* 2017-11 82 263 *345* 2017-12 200 184 *384* 2018-01 65 140 *205* 2018-02 82 98 *180* 2018-03 418 149 *567* 2018-04 295 137 *432* 2018-05 215 95 *310* 2018-06 245 85 *330* 2018-07 233 70 *303* 2018-08 36 62 *98* 2018-09 45 81 *126* 2018-10 66 96 *162* 2018-11 128 135 *263* 2018-12 50 90 *140* 2019-01 68 92 *160* 2019-02 50 68 *118* 2019-03 49 72 *121* 2019-04 33 51 *84* *Total Result* *2360* *1968* *4328* Maybe there is an issue in the way you process the data? Best Joseph On Mon, May 13, 2019 at 3:38 PM Simon Munzert <[email protected]> wrote: > Hi all, > > I've got a question on the completeness of the clickstream dataset. I > downloaded the dumps for 2018 from > https://dumps.wikimedia.org/other/clickstream/ (English Wikipedia only). > When I filter for the article pair "Climate change" and "Global warming" > (either one being either prev or curr) for all of 2018, this is what I get: > > prev curr type n month > <chr> <chr> <chr> <dbl> <chr> > 1 Global_warming Climate_change link 755 2018-04 > 2 Global_warming Climate_change link 810 2018-05 > 3 Climate_change Global_warming link 3730 2018-05 > 4 Climate_change Global_warming link 3962 2018-09 > 5 Climate_change Global_warming link 5865 2018-11 > 6 Climate_change Global_warming link 5491 2018-12 > 7 Global_warming Climate_change link 2227 2018-12 > > The visit numbers seem plausible. But why is there no data on, e.g., > January to March? And why is there data for both directions in May and > December, but not for the others? This seems implausible given the > popularity of the articles. > > Here's another example: > > prev curr type n month > <chr> <chr> <chr> <dbl> <chr> > 1 Smog Air_pollution link 140 2018-01 > 2 Air_pollution Smog link 82 2018-02 > 3 Air_pollution Smog link 295 2018-04 > 4 Air_pollution Smog link 215 2018-05 > 5 Smog Air_pollution link 85 2018-06 > 6 Air_pollution Smog link 233 2018-07 > 7 Air_pollution Smog link 45 2018-09 > 8 Smog Air_pollution link 96 2018-10 > 9 Smog Air_pollution link 90 2018-12 > > Am I missing something here? > > Thanks in advance, > Simon > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > -- Joseph Allemandou (joal) (he / him) Sr Data Engineer Wikimedia Foundation
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
