Hans, Thanks. This does look like the right answer. We thought it might be a previous map-reduce process interfering with itself, but Arvind carefully checked the logs and it failed on the initial start-up. So something else grabbing an ephemeral port is the likely culprit. Not much else running on these VMs, but enough to cause conflicts, I guess.
--Steve > -----Original Message----- > From: Hans Zeller [mailto:[email protected]] > Sent: Tuesday, May 17, 2016 12:21 PM > To: dev <[email protected]> > Subject: Re: Trafodion release2.0 Daily Test Result - 14 - Still Failing > > One option would be to configure the Hadoop/HBase ports > <http://trafodion.apache.org/port-assignment.html> to use the > non-ephemeral > range, another one to change the ephemeral range > <http://unix.stackexchange.com/questions/249275/bind-failure-address-in-use- > unable-to-use-a-tcp-port-for-both-source-and-desti> > so that it doesn't conflict with the Hadoop ports. Is it worth the > trouble, > or do you just want to recognize the conflict quickly and take the > problematic node out of the pool? > > Hans > > On Tue, May 17, 2016 at 11:55 AM, Steve Varnau <[email protected]> > wrote: > > > Arvind and I are picking through the logs. It looks like this > > particular > > VM > > started up in such a way that one of the map-reduce services had a port > > conflict, and hence cloudera manager reported failure every time > > installer > > tried to re-start the cluster. > > > > java.net.BindException: Port in use: 0.0.0.0:50030 > > > > So it is a test environment problem -- the cluster already had an issue > > before trafodion installer ran. > > > > Not quite sure a good way to get an automated fix for the environment. > > Maybe I could code a better health check and take the node offline > > before > > it > > affects multiple test jobs. It is not frequent, but when it occurs, > > several > > jobs can be impacted. > > > > --Steve > > > > > > > -----Original Message----- > > > From: Steve Varnau [mailto:[email protected]] > > > Sent: Tuesday, May 17, 2016 10:25 AM > > > To: '[email protected]' > > > <[email protected]> > > > Subject: RE: Trafodion release2.0 Daily Test Result - 14 - Still > > > Failing > > > > > > Yes, it is interesting that there was one bad node that always > > > reported > > > failure in > > > re-start. > > > The HBase looked good to me, so it might be a different service CMgr > > > is > > > complaining about. > > > I'll spin up that VM so we can examine the logs that were not > > > archived. > > > > > > --Steve > > > > > > > > > > -----Original Message----- > > > > From: Narain Arvind [mailto:[email protected]] > > > > Sent: Tuesday, May 17, 2016 10:22 AM > > > > To: [email protected] > > > > Subject: RE: Trafodion release2.0 Daily Test Result - 14 - Still > > Failing > > > > > > > > Hi Steve, > > > > > > > > All the non-udr failures seem to be related to restart of hbase > > > > environment > > > on > > > > i-0c5597d1. Possible to access this system and look at the logs ? > > > > > > > > "resultMessage" : "Command 'Start' failed for cluster > > > > 'trafcluster'", > > > > "children" : { > > > > "items" : [ { > > > > "id" : 151, > > > > "name" : "Start", > > > > "startTime" : "2016-05-17T06:27:19.295Z", > > > > "endTime" : "2016-05-17T06:28:05.105Z", > > > > "active" : false, > > > > "success" : false, > > > > "resultMessage" : "At least one service failed to start." > > > > > > > > > > > > Thanks > > > > Arvind > > > > > > > > -----Original Message----- > > > > From: [email protected] [mailto:[email protected]] > > > > Sent: Tuesday, May 17, 2016 1:28 AM > > > > To: [email protected] > > > > Subject: Trafodion release2.0 Daily Test Result - 14 - Still Failing > > > > > > > > Daily Automated Testing release2.0 > > > > > > > > Jenkins Job: > > https://jenkins.esgyn.com/job/Check-Daily-release2.0/14/ > > > > Archived Logs: http://traf-testlogs.esgyn.com/Daily-release2.0/14 > > > > Bld Downloads: http://traf-builds.esgyn.com > > > > > > > > Changes since previous daily build: > > > > No changes > > > > > > > > > > > > Test Job Results: > > > > > > > > FAILURE core-regress-charsets-cdh (4 min 27 sec) FAILURE > > > > core-regress- > > > > compGeneral-cdh (9 min 44 sec) FAILURE core-regress-seabase-cdh (4 > min > > > > 44 > > > > sec) FAILURE core-regress-udr-cdh (29 min) FAILURE core-regress-udr- > hdp > > > (41 > > > > min) FAILURE phoenix_part1_T4-cdh (5 min 48 sec) FAILURE > > > phoenix_part2_T2- > > > > cdh (4 min 39 sec) SUCCESS build-release2.0-debug (25 min) SUCCESS > > > > build- > > > > release2.0-release (29 min) SUCCESS core-regress-charsets-hdp (48 > > > > min) > > > > SUCCESS core-regress-compGeneral-hdp (46 min) SUCCESS core-regress- > > > core- > > > > cdh (49 min) SUCCESS core-regress-core-hdp (59 min) SUCCESS > > > > core-regress- > > > > executor-cdh (58 min) SUCCESS core-regress-executor-hdp (1 hr 14 > > > > min) > > > > SUCCESS core-regress-fullstack2-cdh (13 min) SUCCESS core-regress- > > > fullstack2- > > > > hdp (22 min) SUCCESS core-regress-hive-cdh (34 min) SUCCESS > > > > core-regress- > > > > hive-hdp (43 min) SUCCESS core-regress-privs1-cdh (37 min) SUCCESS > > core- > > > > regress-privs1-hdp (56 min) SUCCESS core-regress-privs2-cdh (42 min) > > > SUCCESS > > > > core-regress-privs2-hdp (44 min) SUCCESS core-regress-qat-cdh (21 > > > > min) > > > > SUCCESS core-regress-qat-hdp (21 min) SUCCESS core-regress-seabase- > hdp > > > > (1 > > > > hr 20 min) SUCCESS jdbc_test-cdh (24 min) SUCCESS jdbc_test-hdp (41 > > min) > > > > SUCCESS phoenix_part1_T2-cdh (1 hr 0 min) SUCCESS phoenix_part1_T2- > hdp > > > (1 > > > > hr 30 min) SUCCESS phoenix_part1_T4-hdp (1 hr 6 min) SUCCESS > > > > phoenix_part2_T2-hdp (1 hr 17 min) SUCCESS phoenix_part2_T4-cdh (44 > > min) > > > > SUCCESS phoenix_part2_T4-hdp (1 hr 0 min) SUCCESS pyodbc_test-cdh > (16 > > > min) > > > > SUCCESS pyodbc_test-hdp (15 min) > > > > > >
