Thank you for your email and thoughtful analysis, I just wanted to say I
saw it but got buried with other work.  I'll try and reply early next week.

On Thu, Mar 11, 2021 at 03:50 Ogier Maitre <[email protected]> wrote:

> Hello everybody,
>
> We are currently working on a wikipedia visualisation tool (which is
> presented here: http://www.wikimaps.io/).  We use several pageview
> statistics to generate time series for each page from 2008 to 2020. (we use
> pagecounts, pageviews and pageview_complete). This last format is great for
> our work compared to previous format, and we use it for our data from 2016
> to 2020. (Thank to the analytics team for that).
>
> We aggregate redirections as one page, identified by the page_id (as it is
> done in the pageview_complete files).
> But when we compare with the wikimedia API, we have some small
> differences.
>
> I think this problem comes from the fact that wikimedia API (and
> pageviews.toolforge.org) uses page_title to get the time series, and I
> saw that pageview_complete files contain entries where the page_title is
> missing (replaced by a "-"). As we are using page_id to do the aggregation
> whenever it is possible, we aggregate these "-" entries, but
> pageviews.toolforge.org probably does not.
>
> For example for the page Barack_Obama in French, and the file
> `pageviews-20200112-user.bz2`, I get several relevant entries.
>
>
> fr.wikipedia - 167398 mobile-web 1 B1
> fr.wikipedia Barack 167398 mobile-web 1 X1
> fr.wikipedia Barack_Hussein_Obama 167398 mobile-web 1 J1
> fr.wikipedia Barack_Obama 167398 desktop 748
> A18B10C5D8E3F3G8H6I18J36K41L37M35N37O55P76Q65R57S48T29U56V42W23X32
> fr.wikipedia Barack_Obama 167398 mobile-app 10 A1L1O1Q1T3U2V1
> fr.wikipedia Barack_Obama 167398 mobile-web 1732
> A62B38C28D17E24F10G16H43I40J56K65L78M87N100O95P100Q93R127S84T128U124V184W84X49
> fr.wikipedia Natasha_Obama 167398 desktop 3 Q1R2
> fr.wikipedia Obama 167398 desktop 11 J2K1M1O1Q2R1S1U1W1
> fr.wikipedia Obama 167398 mobile-web 2 R1V1
> fr.wikipedia Obama_Barack 167398 desktop 3 N1P2
> fr.wikipedia Sacha_Obama 167398 desktop 3 J1O2
> fr.wikipedia Sacha_Obama 167398 mobile-web 1 C1
>
> fr.wikipedia Barack_Obama mobile-app 29 B1C1H4J1L1M2N3O3P1R3S5V1W2X1
>
>
> That is 12 entries that use the page_id, and one that does not.
>
> I have two questions about that result.
>
> What kind of query can cause theses "-" entries ?
> Why the entry "Barack_Obama mobile-app" appears two times ?
>
> Sorry for the long introduction and thank you for your time.
>
> Regards,
> Ogier
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to