Thanks Joseph. Am I correct in saying that the counts in pageviews are just the aggregated counts for decoded page titles from pagecounts-all-sites?
Bo On Tue, Mar 1, 2016 at 1:39 PM, Joseph Allemandou <[email protected]> wrote: > Hi ! > pagecounts are regenerated but shouldn't be impacted by the encoding issue > since page_title is not decoded :) > Files I expect to have changed are the new version of pageviews: > http://dumps.wikimedia.org/other/pageviews/2016/2016-02/ > Joseph > > On Tue, Mar 1, 2016 at 9:52 PM, Bo Han <[email protected]> wrote: >> >> Thanks! >> >> Diffing the newly-uploaded files for 20160223-160000 and >> 20160223-170000 with the previously-uploaded ones shows that their >> contents are the same. Were the original pagecounts files at >> http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/ >> not corrupted? The backfill is referring to other data, I assume? >> >> Bo >> >> On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto <[email protected]> wrote: >> > https://phabricator.wikimedia.org/T128295 >> > >> > On Tue, Mar 1, 2016 at 2:15 PM, Bo Han <[email protected]> wrote: >> >> >> >> Hi, >> >> >> >> Would you mind linking the bug fix here? I couldn't find it on >> >> phabricator. >> >> >> >> Thanks, >> >> Bo >> >> >> >> On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou >> >> <[email protected]> wrote: >> >> > Hey Oliver, >> >> > It depends on what data you've used: if page_title or other 'encoding >> >> > sensitive' data (I can't think of any other, but ...) is part of it, >> >> > then >> >> > yes, you should ! >> >> > >> >> > On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes <[email protected]> >> >> > wrote: >> >> >> >> >> >> Hey Joseph, >> >> >> >> >> >> Thanks for letting us know. So we should delete and backfill last >> >> >> week's data, for our regularly scheduled scripts? >> >> >> >> >> >> On 1 March 2016 at 08:26, Joseph Allemandou >> >> >> <[email protected]> >> >> >> wrote: >> >> >> > Hi, >> >> >> > >> >> >> > TL,DR: Please don't use hive / spark / hadoop before next week. >> >> >> > >> >> >> > Last week the Analytics Team performed an upgrade to the Hadoop >> >> >> > Cluster. >> >> >> > It went reasonably well except for many of the hadoop processes >> >> >> > were >> >> >> > launched with a special option to NOT use utf-8 as default >> >> >> > encoding. >> >> >> > This issue caused trouble particularly in page title extraction >> >> >> > and >> >> >> > was >> >> >> > detected last sunday (many kudos to the people having filled bugs >> >> >> > on >> >> >> > Analytics API about encoding :) >> >> >> > We found the bug and fixed it yesterday, and backfill starts >> >> >> > today, >> >> >> > with >> >> >> > the >> >> >> > cluster recomputing every dataset starting 2016-02-23 onward. >> >> >> > This means you shouldn't query last week data during this week, >> >> >> > first >> >> >> > because it is incorrect, and second because you'll curse the >> >> >> > cluster >> >> >> > for >> >> >> > being too slow :) >> >> >> > >> >> >> > We are sorry for the inconvenience. >> >> >> > Don't hesitate to contact us if you have any question >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > Joseph Allemandou >> >> >> > Data Engineer @ Wikimedia Foundation >> >> >> > IRC: joal >> >> >> > >> >> >> > _______________________________________________ >> >> >> > Engineering mailing list >> >> >> > [email protected] >> >> >> > https://lists.wikimedia.org/mailman/listinfo/engineering >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> Oliver Keyes >> >> >> Count Logula >> >> >> Wikimedia Foundation >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > Joseph Allemandou >> >> > Data Engineer @ Wikimedia Foundation >> >> > IRC: joal >> >> > >> >> > _______________________________________________ >> >> > Analytics mailing list >> >> > [email protected] >> >> > https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > >> >> >> >> _______________________________________________ >> >> Analytics mailing list >> >> [email protected] >> >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > >> > >> > >> > _______________________________________________ >> > Analytics mailing list >> > [email protected] >> > https://lists.wikimedia.org/mailman/listinfo/analytics >> > >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > -- > Joseph Allemandou > Data Engineer @ Wikimedia Foundation > IRC: joal > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
