After meeting with the team:
 - Encoding issue was due to locale wrongly set on some machines (but we
don't know why)
 - We will find a way to enforce file.encoding, first looking for a
java-global way, if not feasible, a process-local way.
 - We will NOT spend computing resource on a job trying to detect this
issue (too costly for occurrence probability, particularly if we force
file.encoding)
Cheers
Joseph

On Wed, Mar 2, 2016 at 11:24 AM, Joseph Allemandou <
[email protected]> wrote:

> @Ori: Needs to be discussed with the team - My 2 cents
>
>    - Detection: possible to implement as part of one of the oozie jobs.
>    we will compute number of pages having different page_title for the same
>    uri_path (if high, not good).
>    - Prevention: 2 things possible
>       - Try to understand WHY this thing happened (very difficult I
>       think, possibly related to weird state after upgrade) and ensure we 
> don't
>       fall into that state again
>       - Force JVM file.encoding for every java process of the cluster
>       (probably easier but not really easy not to forget anything)
>
> I'd love to have your thoughts / ideas and discuss them with the team.
> Thanks
>
> On Wed, Mar 2, 2016 at 10:53 AM, Ori Livneh <[email protected]> wrote:
>
>> So: what is the planning for making sure this doesn't happen the next
>> time around? :)
>>
>> On Tue, Mar 1, 2016 at 5:26 AM, Joseph Allemandou <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> *TL,DR: Please don't use hive / spark / hadoop before next week.*
>>>
>>> Last week the Analytics Team performed an upgrade to the Hadoop Cluster.
>>> It went reasonably well except for many of the hadoop processes were
>>> launched with a special option to NOT use utf-8 as default encoding.
>>> This issue caused trouble particularly in page title extraction and was
>>> detected last sunday (many kudos to the people having filled bugs on
>>> Analytics API about encoding :)
>>> We found the bug and fixed it yesterday, and backfill starts today, with
>>> the cluster recomputing every dataset starting 2016-02-23 onward.
>>> This means you shouldn't query last week data during this week, first
>>> because it is incorrect, and second because you'll curse the cluster for
>>> being too slow :)
>>>
>>> We are sorry for the inconvenience.
>>> Don't hesitate to contact us if you have any question
>>>
>>>
>>> --
>>> *Joseph Allemandou*
>>> Data Engineer @ Wikimedia Foundation
>>> IRC: joal
>>>
>>> _______________________________________________
>>> Engineering mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/engineering
>>>
>>>
>>
>
>
> --
> *Joseph Allemandou*
> Data Engineer @ Wikimedia Foundation
> IRC: joal
>



-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to