After meeting with the team: - Encoding issue was due to locale wrongly set on some machines (but we don't know why) - We will find a way to enforce file.encoding, first looking for a java-global way, if not feasible, a process-local way. - We will NOT spend computing resource on a job trying to detect this issue (too costly for occurrence probability, particularly if we force file.encoding) Cheers Joseph
On Wed, Mar 2, 2016 at 11:24 AM, Joseph Allemandou < [email protected]> wrote: > @Ori: Needs to be discussed with the team - My 2 cents > > - Detection: possible to implement as part of one of the oozie jobs. > we will compute number of pages having different page_title for the same > uri_path (if high, not good). > - Prevention: 2 things possible > - Try to understand WHY this thing happened (very difficult I > think, possibly related to weird state after upgrade) and ensure we > don't > fall into that state again > - Force JVM file.encoding for every java process of the cluster > (probably easier but not really easy not to forget anything) > > I'd love to have your thoughts / ideas and discuss them with the team. > Thanks > > On Wed, Mar 2, 2016 at 10:53 AM, Ori Livneh <[email protected]> wrote: > >> So: what is the planning for making sure this doesn't happen the next >> time around? :) >> >> On Tue, Mar 1, 2016 at 5:26 AM, Joseph Allemandou < >> [email protected]> wrote: >> >>> Hi, >>> >>> *TL,DR: Please don't use hive / spark / hadoop before next week.* >>> >>> Last week the Analytics Team performed an upgrade to the Hadoop Cluster. >>> It went reasonably well except for many of the hadoop processes were >>> launched with a special option to NOT use utf-8 as default encoding. >>> This issue caused trouble particularly in page title extraction and was >>> detected last sunday (many kudos to the people having filled bugs on >>> Analytics API about encoding :) >>> We found the bug and fixed it yesterday, and backfill starts today, with >>> the cluster recomputing every dataset starting 2016-02-23 onward. >>> This means you shouldn't query last week data during this week, first >>> because it is incorrect, and second because you'll curse the cluster for >>> being too slow :) >>> >>> We are sorry for the inconvenience. >>> Don't hesitate to contact us if you have any question >>> >>> >>> -- >>> *Joseph Allemandou* >>> Data Engineer @ Wikimedia Foundation >>> IRC: joal >>> >>> _______________________________________________ >>> Engineering mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/engineering >>> >>> >> > > > -- > *Joseph Allemandou* > Data Engineer @ Wikimedia Foundation > IRC: joal > -- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
