[
https://issues.apache.org/jira/browse/AMBARI-18262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Hurley updated AMBARI-18262:
-------------------------------------
Status: Patch Available (was: Open)
> When Enabling NameNode HA Via the UI Wizard, the Second NN Fails to Start
> -------------------------------------------------------------------------
>
> Key: AMBARI-18262
> URL: https://issues.apache.org/jira/browse/AMBARI-18262
> Project: Ambari
> Issue Type: Bug
> Components: ambari-server
> Affects Versions: 2.4.0
> Reporter: Jonathan Hurley
> Assignee: Jonathan Hurley
> Priority: Blocker
> Fix For: trunk
>
> Attachments: AMBARI-18262.patch
>
>
> Caused by: AMBARI-18240
> In enable namenode HA wizard, failure happened at "Start Additional NameNode"
> step.
> The first NameNode starts...
> {code}
> "href" :
> "https://172.22.115.113:8443/api/v1/clusters/cl1/requests/46/tasks/368",
> "Tasks" : {
> "attempt_cnt" : 1,
> "cluster_name" : "cl1",
> "command" : "START",
> "command_detail" : "NAMENODE START",
> "end_time" : 1472080011602,
> "error_log" : "/var/lib/ambari-agent/data/errors-368.txt",
> "exit_code" : 0,
> "host_name" : "nat-sp12-rnqs-amb-views-ha-6-5.openstacklocal",
> "id" : 368,
> "output_log" : "/var/lib/ambari-agent/data/output-368.txt",
> "request_id" : 46,
> "role" : "NAMENODE",
> "stage_id" : 0,
> "start_time" : 1472079963470,
> "status" : "COMPLETED",
> "stderr" : "2016-08-24 23:06:11,102 - Getting jmx metrics from NN failed.
> URL:
> http://nat-sp12-rnqs-amb-views-ha-6-5.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
> (most recent call last):\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
> line 42, in get_value_from_jmx\n return
> data_dict[\"beans\"][0][property]\nIndexError: list index out of
> range\n2016-08-24 23:06:14,332 - Getting jmx metrics from NN failed. URL:
> http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
> (most recent call last):\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
> line 38, in get_value_from_jmx\n _, data, _ = get_user_call_output(cmd,
> user=run_user, quiet=False)\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\",
> line 61, in get_user_call_output\n raise Fail(err_msg)\nFail: Execution
> of 'curl --negotiate -u : -s
> 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'
> 1>/tmp/tmprdewEy 2>/tmp/tmpAmLket' returned 7. \n\n2016-08-24 23:06:22,280 -
> Getting jmx metrics from NN failed. URL:
> http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
> (most recent call last):\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
> line 38, in get_value_from_jmx\n _, data, _ = get_user_call_output(cmd,
> user=run_user, quiet=False)\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\",
> line 61, in get_user_call_output\n raise Fail(err_msg)\nFail: Execution
> of 'curl --negotiate -u : -s
> 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'
> 1>/tmp/tmpHKH50b 2>/tmp/tmp6yyuWH' returned 7. \n\n2016-08-24 23:06:30,637 -
> Getting jmx metrics from NN failed. URL:
> http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
> (most recent call last):\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
> line 38, in get_value_from_jmx\n _, data, _ = get_user_call_output(cmd,
> user=run_user, quiet=False)\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\",
> line 61, in get_user_call_output\n raise Fail(err_msg)\nFail: Execution
> of 'curl --negotiate -u : -s
> 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'
> 1>/tmp/tmpCXMjfH 2>/tmp/tmpq103ei' returned 7. \n\n2016-08-24 23:06:39,495 -
> Getting jmx metrics from NN failed. URL:
> http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
> (most recent call last):\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
> line 38, in get_value_from_jmx\n _, data, _ = get_user_call_output(cmd,
> user=run_user, quiet=False)\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\",
> line 61, in get_user_call_output\n raise Fail(err_msg)\nFail: Execution
> of 'curl --negotiate -u : -s
> 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'
> 1>/tmp/tmpvdE9iJ 2>/tmp/tmpy9eAby' returned 7. \n\n2016-08-24 23:06:47,584 -
> Getting jmx metrics from NN failed. URL:
> http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
> (most recent call last):\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
> line 38, in get_value_from_jmx\n _, data, _ = get_user_call_output(cmd,
> user=run_user, quiet=False)\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py\",
> line 61, in get_user_call_output\n raise Fail(err_msg)\nFail: Execution
> of 'curl --negotiate -u : -s
> 'http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'
> 1>/tmp/tmp0Jx91E 2>/tmp/tmp6qu0gW' returned 7.",
> {code}
> The second does not:
> {code}
> {
> "href" :
> "https://172.22.115.113:8443/api/v1/clusters/cl1/requests/47/tasks/369",
> "Tasks" : {
> "attempt_cnt" : 1,
> "cluster_name" : "cl1",
> "command" : "START",
> "command_detail" : "NAMENODE START",
> "end_time" : 1472080160611,
> "error_log" : "/var/lib/ambari-agent/data/errors-369.txt",
> "exit_code" : 1,
> "host_name" : "nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal",
> "id" : 369,
> "output_log" : "/var/lib/ambari-agent/data/output-369.txt",
> "request_id" : 47,
> "role" : "NAMENODE",
> "stage_id" : 0,
> "start_time" : 1472080026015,
> "status" : "FAILED",
> "stderr" : "2016-08-24 23:07:13,642 - Getting jmx metrics from NN failed.
> URL:
> http://nat-sp12-rnqs-amb-views-ha-6-1.openstacklocal:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem\nTraceback
> (most recent call last):\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py\",
> line 42, in get_value_from_jmx\n return
> data_dict[\"beans\"][0][property]\nIndexError: list index out of
> range\nTraceback (most recent call last):\n File
> \"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py\",
> line 420, in <module>\n NameNode().execute()\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py\",
> line 280, in execute\n method(env)\n File
> \"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py\",
> line 101, in start\n upgrade_suspended=params.upgrade_suspended,
> env=env)\n File
> \"/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py\", line
> 89, in thunk\n return fn(*args, **kwargs)\n File
> \"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py\",
> line 184, in namenode\n if is_this_namenode_active() is False:\n File
> \"/usr/lib/python2.6/site-packages/resource_management/libraries/functions/decorator.py\",
> line 55, in wrapper\n return function(*args, **kwargs)\n File
> \"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py\",
> line 549, in is_this_namenode_active\n raise Fail(format(\"The NameNode
> {namenode_id} is not listed as Active or Standby,
> waiting...\"))\nresource_management.core.exceptions.Fail: The NameNode nn2 is
> not listed as Active or Standby, waiting...",
> {code}
> When the UI enables NN HA first starts NN1 than NN2. At this stage both NNs
> are in 'standby' mode. The active node will be elected only later ( I believe
> when ZKFC is installed and started) thus I think the second NN start
> shouldn't be failed if no active name node was found:
> 1st NN start:
> {code:title=nat-sp12-rnqs-amb-views-ha-7-5.openstacklocal}
> 2016-08-24 23:08:20,037 - NameNode HA states: active_namenodes = [],
> standby_namenodes = [(u'nn1',
> 'nat-sp12-rnqs-amb-views-ha-7-5.openstacklocal:50070')], unknown_namenodes =
> [(u'nn2', 'nat-sp12-rnqs-amb-views-ha-7-3.openstacklocal:50070')]
> 2016-08-24 23:08:20,037 - No active NameNode was found after 5 retries. Will
> return current NameNode HA states
> 2016-08-24 23:08:20,037 - Skipping Safemode check due to the following
> conditions: HA: True, isActive: False, upgradeType: None
> 2016-08-24 23:08:20,037 - Skipping creation of HDFS directories since this is
> either not the Active NameNode or we did not wait for Safemode to finish.
> Command completed successfully!
> {code}
> 2nd NN start:
> {code:title=nat-sp12-rnqs-amb-views-ha-7-3.openstacklocal}
> 2016-08-24 23:10:51,011 - NameNode HA states: active_namenodes = [],
> standby_namenodes = [(u'nn1',
> 'nat-sp12-rnqs-amb-views-ha-7-5.openstacklocal:50070'), (u'nn2',
> 'nat-sp12-rnqs-amb-views-ha-7-3.openstacklocal:50070')], unknown_namenodes =
> []
> 2016-08-24 23:10:51,012 - No active NameNode was found after 5 retries. Will
> return current NameNode HA states
> Command failed after 1 tries
> {code}
> Since the 2nd NN start failed the wizard does not continue with installing
> ZKFC and rest of the steps.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)