Hi ! pagecounts are regenerated but shouldn't be impacted by the encoding issue since page_title is not decoded :) Files I expect to have changed are the new version of pageviews: http://dumps.wikimedia.org/other/pageviews/2016/2016-02/ Joseph
On Tue, Mar 1, 2016 at 9:52 PM, Bo Han <[email protected]> wrote: > Thanks! > > Diffing the newly-uploaded files for 20160223-160000 and > 20160223-170000 with the previously-uploaded ones shows that their > contents are the same. Were the original pagecounts files at > http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/ > not corrupted? The backfill is referring to other data, I assume? > > Bo > > On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto <[email protected]> wrote: > > https://phabricator.wikimedia.org/T128295 > > > > On Tue, Mar 1, 2016 at 2:15 PM, Bo Han <[email protected]> wrote: > >> > >> Hi, > >> > >> Would you mind linking the bug fix here? I couldn't find it on > >> phabricator. > >> > >> Thanks, > >> Bo > >> > >> On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou > >> <[email protected]> wrote: > >> > Hey Oliver, > >> > It depends on what data you've used: if page_title or other 'encoding > >> > sensitive' data (I can't think of any other, but ...) is part of it, > >> > then > >> > yes, you should ! > >> > > >> > On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes <[email protected]> > >> > wrote: > >> >> > >> >> Hey Joseph, > >> >> > >> >> Thanks for letting us know. So we should delete and backfill last > >> >> week's data, for our regularly scheduled scripts? > >> >> > >> >> On 1 March 2016 at 08:26, Joseph Allemandou < > [email protected]> > >> >> wrote: > >> >> > Hi, > >> >> > > >> >> > TL,DR: Please don't use hive / spark / hadoop before next week. > >> >> > > >> >> > Last week the Analytics Team performed an upgrade to the Hadoop > >> >> > Cluster. > >> >> > It went reasonably well except for many of the hadoop processes > were > >> >> > launched with a special option to NOT use utf-8 as default > encoding. > >> >> > This issue caused trouble particularly in page title extraction and > >> >> > was > >> >> > detected last sunday (many kudos to the people having filled bugs > on > >> >> > Analytics API about encoding :) > >> >> > We found the bug and fixed it yesterday, and backfill starts today, > >> >> > with > >> >> > the > >> >> > cluster recomputing every dataset starting 2016-02-23 onward. > >> >> > This means you shouldn't query last week data during this week, > first > >> >> > because it is incorrect, and second because you'll curse the > cluster > >> >> > for > >> >> > being too slow :) > >> >> > > >> >> > We are sorry for the inconvenience. > >> >> > Don't hesitate to contact us if you have any question > >> >> > > >> >> > > >> >> > -- > >> >> > Joseph Allemandou > >> >> > Data Engineer @ Wikimedia Foundation > >> >> > IRC: joal > >> >> > > >> >> > _______________________________________________ > >> >> > Engineering mailing list > >> >> > [email protected] > >> >> > https://lists.wikimedia.org/mailman/listinfo/engineering > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> Oliver Keyes > >> >> Count Logula > >> >> Wikimedia Foundation > >> > > >> > > >> > > >> > > >> > -- > >> > Joseph Allemandou > >> > Data Engineer @ Wikimedia Foundation > >> > IRC: joal > >> > > >> > _______________________________________________ > >> > Analytics mailing list > >> > [email protected] > >> > https://lists.wikimedia.org/mailman/listinfo/analytics > >> > > >> > >> _______________________________________________ > >> Analytics mailing list > >> [email protected] > >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > > > _______________________________________________ > > Analytics mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > -- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
