Thanks Joseph. Am I correct in saying that the counts in pageviews are
just the aggregated counts for decoded page titles from
pagecounts-all-sites?

Bo

On Tue, Mar 1, 2016 at 1:39 PM, Joseph Allemandou
<[email protected]> wrote:
> Hi !
> pagecounts are regenerated but shouldn't be impacted by the encoding issue
> since page_title is not decoded :)
> Files I expect to have changed are the new version of pageviews:
> http://dumps.wikimedia.org/other/pageviews/2016/2016-02/
> Joseph
>
> On Tue, Mar 1, 2016 at 9:52 PM, Bo Han <[email protected]> wrote:
>>
>> Thanks!
>>
>> Diffing the newly-uploaded files for 20160223-160000 and
>> 20160223-170000 with the previously-uploaded ones shows that their
>> contents are the same. Were the original pagecounts files at
>> http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/
>> not corrupted? The backfill is referring to other data, I assume?
>>
>> Bo
>>
>> On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto <[email protected]> wrote:
>> > https://phabricator.wikimedia.org/T128295
>> >
>> > On Tue, Mar 1, 2016 at 2:15 PM, Bo Han <[email protected]> wrote:
>> >>
>> >> Hi,
>> >>
>> >> Would you mind linking the bug fix here? I couldn't find it on
>> >> phabricator.
>> >>
>> >> Thanks,
>> >> Bo
>> >>
>> >> On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou
>> >> <[email protected]> wrote:
>> >> > Hey Oliver,
>> >> > It depends on what data you've used: if page_title or other 'encoding
>> >> > sensitive' data (I can't think of any other, but ...) is part of it,
>> >> > then
>> >> > yes, you should !
>> >> >
>> >> > On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes <[email protected]>
>> >> > wrote:
>> >> >>
>> >> >> Hey Joseph,
>> >> >>
>> >> >> Thanks for letting us know. So we should delete and backfill last
>> >> >> week's data, for our regularly scheduled scripts?
>> >> >>
>> >> >> On 1 March 2016 at 08:26, Joseph Allemandou
>> >> >> <[email protected]>
>> >> >> wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> > TL,DR: Please don't use hive / spark / hadoop before next week.
>> >> >> >
>> >> >> > Last week the Analytics Team performed an upgrade to the Hadoop
>> >> >> > Cluster.
>> >> >> > It went reasonably well except for many of the hadoop processes
>> >> >> > were
>> >> >> > launched with a special option to NOT use utf-8 as default
>> >> >> > encoding.
>> >> >> > This issue caused trouble particularly in page title extraction
>> >> >> > and
>> >> >> > was
>> >> >> > detected last sunday (many kudos to the people having filled bugs
>> >> >> > on
>> >> >> > Analytics API about encoding :)
>> >> >> > We found the bug and fixed it yesterday, and backfill starts
>> >> >> > today,
>> >> >> > with
>> >> >> > the
>> >> >> > cluster recomputing every dataset starting 2016-02-23 onward.
>> >> >> > This means you shouldn't query last week data during this week,
>> >> >> > first
>> >> >> > because it is incorrect, and second because you'll curse the
>> >> >> > cluster
>> >> >> > for
>> >> >> > being too slow :)
>> >> >> >
>> >> >> > We are sorry for the inconvenience.
>> >> >> > Don't hesitate to contact us if you have any question
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Joseph Allemandou
>> >> >> > Data Engineer @ Wikimedia Foundation
>> >> >> > IRC: joal
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > Engineering mailing list
>> >> >> > [email protected]
>> >> >> > https://lists.wikimedia.org/mailman/listinfo/engineering
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Oliver Keyes
>> >> >> Count Logula
>> >> >> Wikimedia Foundation
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Joseph Allemandou
>> >> > Data Engineer @ Wikimedia Foundation
>> >> > IRC: joal
>> >> >
>> >> > _______________________________________________
>> >> > Analytics mailing list
>> >> > [email protected]
>> >> > https://lists.wikimedia.org/mailman/listinfo/analytics
>> >> >
>> >>
>> >> _______________________________________________
>> >> Analytics mailing list
>> >> [email protected]
>> >> https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>> >
>> >
>> > _______________________________________________
>> > Analytics mailing list
>> > [email protected]
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
>
> --
> Joseph Allemandou
> Data Engineer @ Wikimedia Foundation
> IRC: joal
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to