Hi again Ogier, > I don't exactly understand the part, about the page_id being defined in the request. I thought the page_id was "resolved" based on the page_title being in the uri_query.
This is not how the page_id is set in our traffic datasets :) We receive the page_id in HTTP-Header, set by the UIs. We have historically received the values for `desktop` and `mobile-web` pretty consistently, but the fact that we receive them for `mobile-app` is new to me :) I assume that getting data consistently will then be a matter of mobile-app updates. I hope this helps :) Cheers Joseph On Mon, Mar 15, 2021 at 4:29 PM Ogier Maitre <[email protected]> wrote: > Hello Joseph, > > Thank you for your detailed response. > We suspected curid could be part of the equation here, but it is nice to > have it confirmed here (at least for a part of the answer). > > > The entry appears two times because for one of them there is no page_id > defined in the request, therefore it is categorised as different from the > one having a page_id defined. > > I don't exactly understand the part, about the page_id being defined in > the request. I thought the page_id was "resolved" based on the page_title > being in the uri_query. > But this is more to satisfy my curiosity has I'm currently bundling these > entries with the one having a page_id, thanks to the page.sql table. I was > mainly asking this, in hope to see these kind of entry disappear in the > future, which could simplify my aggregation process. > > Thank you again for your answer. > Regards, > Ogier > > > Le 15 mars 2021 à 14:10, Joseph Allemandou <[email protected]> a > écrit : > > Hello Ogier, > Thank you a lot for the wikimaps work, and your thorough analysis on the > pageviews :) > > Here is what I found on your two questions, investigating one day of > `user` visited pageviews recent data (we keep detailed data for 90 days > only and I needed those detailed for the analysis). > > > What kind of query can cause theses "-" entries ? > Pages with a defined page_id and an undefined title ('-') were > representing 0.04%, a bit more than 227k hits. > Among those, 152K requests were having a `curid=NUMBER` in their uri_query > (meaning they were specifying the page to view only by id, and we don't > extract page_title from ids). > More than 65K don't have any page-title nor page-id specified in the URLs, > but have one specified in HTTP headers. This feels like either a bug or an > unexpected user behavior. > And more than 10k are using a `diff=` uri pattern, providing diff between > revisions for a given page, but not providing the page in the URL. > I also found, for mobile-app' cases, that some page-titles were > incorrectly rejected as invalid for chinese wikipedia. This happens on a > very small number of lines (less than 10 per day from my findings). > > > Why the entry "Barack_Obama mobile-app" appears two times ? > The entry appears two times because for one of them there is no page_id > defined in the request, therefore it is categorised as different from the > one having a page_id defined. While it could be possible to bundle all rows > with the same title to have a page_id if one of the rows have the page_id > defined, we could also have problems for hours where a rename occurs (two > different page_ids for the same title). I'll bring the concern to the team, > but given the relatively small number of views impacted by this case, there > are chances we will not prioritise it soon. > > Please let us know if you have other questions :) > Best > Joseph > > > > > > On Sun, Mar 14, 2021 at 1:53 AM Dan Andreescu <[email protected]> > wrote: > >> Thank you for your email and thoughtful analysis, I just wanted to say I >> saw it but got buried with other work. I'll try and reply early next week. >> >> On Thu, Mar 11, 2021 at 03:50 Ogier Maitre <[email protected]> wrote: >> >>> Hello everybody, >>> >>> We are currently working on a wikipedia visualisation tool (which is >>> presented here: http://www.wikimaps.io/). We use several pageview >>> statistics to generate time series for each page from 2008 to 2020. (we use >>> pagecounts, pageviews and pageview_complete). This last format is great for >>> our work compared to previous format, and we use it for our data from 2016 >>> to 2020. (Thank to the analytics team for that). >>> >>> We aggregate redirections as one page, identified by the page_id (as it >>> is done in the pageview_complete files). >>> But when we compare with the wikimedia API, we have some small >>> differences. >>> >>> I think this problem comes from the fact that wikimedia API (and >>> pageviews.toolforge.org) uses page_title to get the time series, and I >>> saw that pageview_complete files contain entries where the page_title is >>> missing (replaced by a "-"). As we are using page_id to do the aggregation >>> whenever it is possible, we aggregate these "-" entries, but >>> pageviews.toolforge.org probably does not. >>> >>> For example for the page Barack_Obama in French, and the file >>> `pageviews-20200112-user.bz2`, I get several relevant entries. >>> >>> >>> fr.wikipedia - 167398 mobile-web 1 B1 >>> fr.wikipedia Barack 167398 mobile-web 1 X1 >>> fr.wikipedia Barack_Hussein_Obama 167398 mobile-web 1 J1 >>> fr.wikipedia Barack_Obama 167398 desktop 748 >>> A18B10C5D8E3F3G8H6I18J36K41L37M35N37O55P76Q65R57S48T29U56V42W23X32 >>> fr.wikipedia Barack_Obama 167398 mobile-app 10 A1L1O1Q1T3U2V1 >>> fr.wikipedia Barack_Obama 167398 mobile-web 1732 >>> A62B38C28D17E24F10G16H43I40J56K65L78M87N100O95P100Q93R127S84T128U124V184W84X49 >>> fr.wikipedia Natasha_Obama 167398 desktop 3 Q1R2 >>> fr.wikipedia Obama 167398 desktop 11 J2K1M1O1Q2R1S1U1W1 >>> fr.wikipedia Obama 167398 mobile-web 2 R1V1 >>> fr.wikipedia Obama_Barack 167398 desktop 3 N1P2 >>> fr.wikipedia Sacha_Obama 167398 desktop 3 J1O2 >>> fr.wikipedia Sacha_Obama 167398 mobile-web 1 C1 >>> >>> fr.wikipedia Barack_Obama mobile-app 29 B1C1H4J1L1M2N3O3P1R3S5V1W2X1 >>> >>> >>> That is 12 entries that use the page_id, and one that does not. >>> >>> I have two questions about that result. >>> >>> What kind of query can cause theses "-" entries ? >>> Why the entry "Barack_Obama mobile-app" appears two times ? >>> >>> Sorry for the long introduction and thank you for your time. >>> >>> Regards, >>> Ogier >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > -- > Joseph Allemandou (joal) (he / him) > Staff Data Engineer > Wikimedia Foundation > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > -- Joseph Allemandou (joal) (he / him) Staff Data Engineer Wikimedia Foundation
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
