Thank you for your email and thoughtful analysis, I just wanted to say I saw it but got buried with other work. I'll try and reply early next week.
On Thu, Mar 11, 2021 at 03:50 Ogier Maitre <[email protected]> wrote: > Hello everybody, > > We are currently working on a wikipedia visualisation tool (which is > presented here: http://www.wikimaps.io/). We use several pageview > statistics to generate time series for each page from 2008 to 2020. (we use > pagecounts, pageviews and pageview_complete). This last format is great for > our work compared to previous format, and we use it for our data from 2016 > to 2020. (Thank to the analytics team for that). > > We aggregate redirections as one page, identified by the page_id (as it is > done in the pageview_complete files). > But when we compare with the wikimedia API, we have some small > differences. > > I think this problem comes from the fact that wikimedia API (and > pageviews.toolforge.org) uses page_title to get the time series, and I > saw that pageview_complete files contain entries where the page_title is > missing (replaced by a "-"). As we are using page_id to do the aggregation > whenever it is possible, we aggregate these "-" entries, but > pageviews.toolforge.org probably does not. > > For example for the page Barack_Obama in French, and the file > `pageviews-20200112-user.bz2`, I get several relevant entries. > > > fr.wikipedia - 167398 mobile-web 1 B1 > fr.wikipedia Barack 167398 mobile-web 1 X1 > fr.wikipedia Barack_Hussein_Obama 167398 mobile-web 1 J1 > fr.wikipedia Barack_Obama 167398 desktop 748 > A18B10C5D8E3F3G8H6I18J36K41L37M35N37O55P76Q65R57S48T29U56V42W23X32 > fr.wikipedia Barack_Obama 167398 mobile-app 10 A1L1O1Q1T3U2V1 > fr.wikipedia Barack_Obama 167398 mobile-web 1732 > A62B38C28D17E24F10G16H43I40J56K65L78M87N100O95P100Q93R127S84T128U124V184W84X49 > fr.wikipedia Natasha_Obama 167398 desktop 3 Q1R2 > fr.wikipedia Obama 167398 desktop 11 J2K1M1O1Q2R1S1U1W1 > fr.wikipedia Obama 167398 mobile-web 2 R1V1 > fr.wikipedia Obama_Barack 167398 desktop 3 N1P2 > fr.wikipedia Sacha_Obama 167398 desktop 3 J1O2 > fr.wikipedia Sacha_Obama 167398 mobile-web 1 C1 > > fr.wikipedia Barack_Obama mobile-app 29 B1C1H4J1L1M2N3O3P1R3S5V1W2X1 > > > That is 12 entries that use the page_id, and one that does not. > > I have two questions about that result. > > What kind of query can cause theses "-" entries ? > Why the entry "Barack_Obama mobile-app" appears two times ? > > Sorry for the long introduction and thank you for your time. > > Regards, > Ogier > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
