Thanks! Diffing the newly-uploaded files for 20160223-160000 and 20160223-170000 with the previously-uploaded ones shows that their contents are the same. Were the original pagecounts files at http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/ not corrupted? The backfill is referring to other data, I assume?
Bo On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto <[email protected]> wrote: > https://phabricator.wikimedia.org/T128295 > > On Tue, Mar 1, 2016 at 2:15 PM, Bo Han <[email protected]> wrote: >> >> Hi, >> >> Would you mind linking the bug fix here? I couldn't find it on >> phabricator. >> >> Thanks, >> Bo >> >> On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou >> <[email protected]> wrote: >> > Hey Oliver, >> > It depends on what data you've used: if page_title or other 'encoding >> > sensitive' data (I can't think of any other, but ...) is part of it, >> > then >> > yes, you should ! >> > >> > On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes <[email protected]> >> > wrote: >> >> >> >> Hey Joseph, >> >> >> >> Thanks for letting us know. So we should delete and backfill last >> >> week's data, for our regularly scheduled scripts? >> >> >> >> On 1 March 2016 at 08:26, Joseph Allemandou <[email protected]> >> >> wrote: >> >> > Hi, >> >> > >> >> > TL,DR: Please don't use hive / spark / hadoop before next week. >> >> > >> >> > Last week the Analytics Team performed an upgrade to the Hadoop >> >> > Cluster. >> >> > It went reasonably well except for many of the hadoop processes were >> >> > launched with a special option to NOT use utf-8 as default encoding. >> >> > This issue caused trouble particularly in page title extraction and >> >> > was >> >> > detected last sunday (many kudos to the people having filled bugs on >> >> > Analytics API about encoding :) >> >> > We found the bug and fixed it yesterday, and backfill starts today, >> >> > with >> >> > the >> >> > cluster recomputing every dataset starting 2016-02-23 onward. >> >> > This means you shouldn't query last week data during this week, first >> >> > because it is incorrect, and second because you'll curse the cluster >> >> > for >> >> > being too slow :) >> >> > >> >> > We are sorry for the inconvenience. >> >> > Don't hesitate to contact us if you have any question >> >> > >> >> > >> >> > -- >> >> > Joseph Allemandou >> >> > Data Engineer @ Wikimedia Foundation >> >> > IRC: joal >> >> > >> >> > _______________________________________________ >> >> > Engineering mailing list >> >> > [email protected] >> >> > https://lists.wikimedia.org/mailman/listinfo/engineering >> >> > >> >> >> >> >> >> >> >> -- >> >> Oliver Keyes >> >> Count Logula >> >> Wikimedia Foundation >> > >> > >> > >> > >> > -- >> > Joseph Allemandou >> > Data Engineer @ Wikimedia Foundation >> > IRC: joal >> > >> > _______________________________________________ >> > Analytics mailing list >> > [email protected] >> > https://lists.wikimedia.org/mailman/listinfo/analytics >> > >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
