Hi !
pagecounts are regenerated but shouldn't be impacted by the encoding issue
since page_title is not decoded :)
Files I expect to have changed are the new version of pageviews:
http://dumps.wikimedia.org/other/pageviews/2016/2016-02/
Joseph

On Tue, Mar 1, 2016 at 9:52 PM, Bo Han <[email protected]> wrote:

> Thanks!
>
> Diffing the newly-uploaded files for 20160223-160000 and
> 20160223-170000 with the previously-uploaded ones shows that their
> contents are the same. Were the original pagecounts files at
> http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/
> not corrupted? The backfill is referring to other data, I assume?
>
> Bo
>
> On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto <[email protected]> wrote:
> > https://phabricator.wikimedia.org/T128295
> >
> > On Tue, Mar 1, 2016 at 2:15 PM, Bo Han <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> Would you mind linking the bug fix here? I couldn't find it on
> >> phabricator.
> >>
> >> Thanks,
> >> Bo
> >>
> >> On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou
> >> <[email protected]> wrote:
> >> > Hey Oliver,
> >> > It depends on what data you've used: if page_title or other 'encoding
> >> > sensitive' data (I can't think of any other, but ...) is part of it,
> >> > then
> >> > yes, you should !
> >> >
> >> > On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes <[email protected]>
> >> > wrote:
> >> >>
> >> >> Hey Joseph,
> >> >>
> >> >> Thanks for letting us know. So we should delete and backfill last
> >> >> week's data, for our regularly scheduled scripts?
> >> >>
> >> >> On 1 March 2016 at 08:26, Joseph Allemandou <
> [email protected]>
> >> >> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > TL,DR: Please don't use hive / spark / hadoop before next week.
> >> >> >
> >> >> > Last week the Analytics Team performed an upgrade to the Hadoop
> >> >> > Cluster.
> >> >> > It went reasonably well except for many of the hadoop processes
> were
> >> >> > launched with a special option to NOT use utf-8 as default
> encoding.
> >> >> > This issue caused trouble particularly in page title extraction and
> >> >> > was
> >> >> > detected last sunday (many kudos to the people having filled bugs
> on
> >> >> > Analytics API about encoding :)
> >> >> > We found the bug and fixed it yesterday, and backfill starts today,
> >> >> > with
> >> >> > the
> >> >> > cluster recomputing every dataset starting 2016-02-23 onward.
> >> >> > This means you shouldn't query last week data during this week,
> first
> >> >> > because it is incorrect, and second because you'll curse the
> cluster
> >> >> > for
> >> >> > being too slow :)
> >> >> >
> >> >> > We are sorry for the inconvenience.
> >> >> > Don't hesitate to contact us if you have any question
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Joseph Allemandou
> >> >> > Data Engineer @ Wikimedia Foundation
> >> >> > IRC: joal
> >> >> >
> >> >> > _______________________________________________
> >> >> > Engineering mailing list
> >> >> > [email protected]
> >> >> > https://lists.wikimedia.org/mailman/listinfo/engineering
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Oliver Keyes
> >> >> Count Logula
> >> >> Wikimedia Foundation
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Joseph Allemandou
> >> > Data Engineer @ Wikimedia Foundation
> >> > IRC: joal
> >> >
> >> > _______________________________________________
> >> > Analytics mailing list
> >> > [email protected]
> >> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >> >
> >>
> >> _______________________________________________
> >> Analytics mailing list
> >> [email protected]
> >> https://lists.wikimedia.org/mailman/listinfo/analytics
> >
> >
> >
> > _______________________________________________
> > Analytics mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to