Hi all, I’m continuing this thread just because it involves a little more Hadoop downtime.
Ops will be replacing a switch tomorrow. This was previously announced in reference to eventlog1001, but this switch replacement will affect Hadoop as well. During this migration, the ResourceManager will not be reachable, which means that running jobs could likely die. This switch replacement is scheduled to take about 15 minutes, starting at 13:00 UTC tomorrow. Joseph and I will monitor the status of things, and restart any necessary jobs. Today I worked on getting a High Availability ResourceManager in place (we seem to be needing to restart that thing much more often these days), but I won’t be able to have this installed tomorrow. I foresee a few more cluster restarts in the next week or so, in order to apply this and some other changes. These restarts won’t result in long term cluster downtime like Monday’s upgrade did, but they might cause some disrupted jobs. I will announce any restarts here. Thanks! -Ao > On May 4, 2015, at 17:42, Andrew Otto <[email protected]> wrote: > > Phew, ok, things did go wrong! We ran into a couple of bugs recently > introduced in Yarn and in Hive that took us a while to find work arounds. > Jobs are again flowing through the cluster. However, jobs have been lagging > behind since they haven’t been able to run all day. They should eventually > catch up. For now, the cluster is back open for business, but I’d appreciate > if no one ran any heavy jobs until tomorrow. > > Also, it is still possible we may run into other issues we haven’t yet seen, > so I can’t guarantee that I won’t have to restart things again. > > > Anyway, aside from those hiccups. CDH 5.4.0 is now installed, Hive 1.1 and > Spark 1.3.0 are now available, weeeeee! > > -Ao > > >> On May 4, 2015, at 11:05, Andrew Otto <[email protected]> wrote: >> >> Hi all, as a reminder, I will be doing this upgrade today. Within the next >> hour I will turn off the Hadoop cluster. Please do not attempt to use it >> again until I notify you again. >> >> Thanks! >> -AO >> >> >> >>> On Apr 29, 2015, at 14:57, Robert West <[email protected]> wrote: >>> >>> All good! >>> >>> On Wed, Apr 29, 2015 at 11:35 AM, Aaron Halfaker >>> <[email protected]> wrote: >>>> + the right research list (Andrew, remove wmfresearch@ from your contact >>>> list :P ) >>>> >>>> All looks good to me. Thanks. :) >>>> >>>> On Wed, Apr 29, 2015 at 1:11 PM, Leila Zia <[email protected]> wrote: >>>>> >>>>> FYI >>>>> >>>>> Ashwin, Bob, Ellery, I don't anticipate this having negative impact on our >>>>> workflow. If you see possible issues, please communicate with Andrew >>>>> (cc-ing >>>>> me), or let me know and I communicate. Thanks! >>>>> >>>>> >>>>> ---------- Forwarded message ---------- >>>>> From: Andrew Otto <[email protected]> >>>>> Date: Wed, Apr 29, 2015 at 11:05 AM >>>>> Subject: [wmfresearch] Hadoop Cluster Downtime >>>>> To: Operations Engineers <[email protected]>, "A mailing list for >>>>> the Analytics Team at WMF and everybody who has an interest in Wikipedia >>>>> and >>>>> analytics." <[email protected]>, >>>>> "[email protected] Research" >>>>> <[email protected]> >>>>> >>>>> >>>>> Hi all! >>>>> >>>>> CDH 5.4 is out[1] and we’d like to upgrade. We are doing this now, rather >>>>> than later, because there is an important Parquet/Hive related bug that >>>>> has >>>>> been fixed in this version[2]. This upgrade will include Spark 1.3, which >>>>> should at least make one researcher happy. >>>>> >>>>> To do this upgrade, I need to schedule some downtime for Hadoop. I’d like >>>>> to do this on Monday May 4th. I expect the upgrade to take me no more >>>>> than >>>>> an hour or two, but just to be safe I’d like to schedule the downtime for >>>>> the whole day. >>>>> >>>>> If anyone has critical things that they absolutely have to run on Monday, >>>>> let me know now and I will find another day. >>>>> >>>>> Thanks! >>>>> -Ao >>>>> >>>>> [1] >>>>> http://blog.cloudera.com/blog/2015/04/cloudera-enterprise-5-4-is-released/ >>>>> [2] https://issues.apache.org/jira/browse/HIVE-9482 >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> wmfresearch mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/wmfresearch >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Research-Internal mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/research-internal >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Research-Internal mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/research-internal >>>> >>> >>> >>> >>> -- >>> Up for a little language game? -- http://www.unfun.me >>> >>> _______________________________________________ >>> Ops mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/ops >> > _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
