@Ori: Needs to be discussed with the team - My 2 cents

   - Detection: possible to implement as part of one of the oozie jobs. we
   will compute number of pages having different page_title for the same
   uri_path (if high, not good).
   - Prevention: 2 things possible
      - Try to understand WHY this thing happened (very difficult I think,
      possibly related to weird state after upgrade) and ensure we don't fall
      into that state again
      - Force JVM file.encoding for every java process of the cluster
      (probably easier but not really easy not to forget anything)

I'd love to have your thoughts / ideas and discuss them with the team.
Thanks

On Wed, Mar 2, 2016 at 10:53 AM, Ori Livneh <[email protected]> wrote:

> So: what is the planning for making sure this doesn't happen the next time
> around? :)
>
> On Tue, Mar 1, 2016 at 5:26 AM, Joseph Allemandou <
> [email protected]> wrote:
>
>> Hi,
>>
>> *TL,DR: Please don't use hive / spark / hadoop before next week.*
>>
>> Last week the Analytics Team performed an upgrade to the Hadoop Cluster.
>> It went reasonably well except for many of the hadoop processes were
>> launched with a special option to NOT use utf-8 as default encoding.
>> This issue caused trouble particularly in page title extraction and was
>> detected last sunday (many kudos to the people having filled bugs on
>> Analytics API about encoding :)
>> We found the bug and fixed it yesterday, and backfill starts today, with
>> the cluster recomputing every dataset starting 2016-02-23 onward.
>> This means you shouldn't query last week data during this week, first
>> because it is incorrect, and second because you'll curse the cluster for
>> being too slow :)
>>
>> We are sorry for the inconvenience.
>> Don't hesitate to contact us if you have any question
>>
>>
>> --
>> *Joseph Allemandou*
>> Data Engineer @ Wikimedia Foundation
>> IRC: joal
>>
>> _______________________________________________
>> Engineering mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/engineering
>>
>>
>


-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to