[ https://issues.apache.org/jira/browse/AMBARI-25604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Onischuk updated AMBARI-25604: ------------------------------------- Description: During blueprint deploy we don't rely on topology cache since AMBARI-23660 So correct topology is send with the command, however the topology from the topology event can be wrong as per AMBARI-23660. The problem occurs when we still try to process broken topology from the event on agent. Agent need to handle this failure with a warning. Currently it just fails the whole command. {code:java}ERROR 2020-12-10 06:30:09,350 CustomServiceOrchestrator.py:459 - Caught an exception while executing custom service command: <type 'exceptions.KeyError'>: 10; 10 Traceback (most recent call last): File "/usr/lib/ambari-agent/lib/ambari_agent/CustomServiceOrchestrator.py", line 324, in runCommand command = self.generate_command(command_header) File "/usr/lib/ambari-agent/lib/ambari_agent/CustomServiceOrchestrator.py", line 507, in generate_command command_dict = self.configuration_builder.get_configuration(cluster_id, service_name, component_name, required_config_timestamp) File "/usr/lib/ambari-agent/lib/ambari_agent/ConfigurationBuilder.py", line 43, in get_configuration 'clusterHostInfo': self.topology_cache.get_cluster_host_info(cluster_id), File "/usr/lib/ambari-agent/lib/ambari_agent/Utils.py", line 230, in newFunction return f(*args, **kw) File "/usr/lib/ambari-agent/lib/ambari_agent/ClusterTopologyCache.py", line 112, in get_cluster_host_info hostnames = [self.hosts_to_id[cluster_id][host_id].hostName for host_id in component_dict.hostIds] KeyError: 10{code} was: During blueprint deploy we don't rely on topology cache since AMBARI-23660 So correct topology is send with the command, however the topology from the topology event can be wrong as per AMBARI-23660. The problem occurs when we still try to process broken topology from the event on agent. Agent need to handle this failure with a warning. {code:java}ERROR 2020-12-10 06:30:09,350 CustomServiceOrchestrator.py:459 - Caught an exception while executing custom service command: <type 'exceptions.KeyError'>: 10; 10 Traceback (most recent call last): File "/usr/lib/ambari-agent/lib/ambari_agent/CustomServiceOrchestrator.py", line 324, in runCommand command = self.generate_command(command_header) File "/usr/lib/ambari-agent/lib/ambari_agent/CustomServiceOrchestrator.py", line 507, in generate_command command_dict = self.configuration_builder.get_configuration(cluster_id, service_name, component_name, required_config_timestamp) File "/usr/lib/ambari-agent/lib/ambari_agent/ConfigurationBuilder.py", line 43, in get_configuration 'clusterHostInfo': self.topology_cache.get_cluster_host_info(cluster_id), File "/usr/lib/ambari-agent/lib/ambari_agent/Utils.py", line 230, in newFunction return f(*args, **kw) File "/usr/lib/ambari-agent/lib/ambari_agent/ClusterTopologyCache.py", line 112, in get_cluster_host_info hostnames = [self.hosts_to_id[cluster_id][host_id].hostName for host_id in component_dict.hostIds] KeyError: 10{code} > During blueprint deploy tasks sometimes fail due to KeyError on large clusters > ------------------------------------------------------------------------------ > > Key: AMBARI-25604 > URL: https://issues.apache.org/jira/browse/AMBARI-25604 > Project: Ambari > Issue Type: Bug > Reporter: Andrew Onischuk > Assignee: Andrew Onischuk > Priority: Major > Fix For: 2.7.6 > > Time Spent: 20m > Remaining Estimate: 0h > > During blueprint deploy we don't rely on topology cache since AMBARI-23660 > So correct topology is send with > the command, however the topology from the topology event can be wrong as per > AMBARI-23660. > The problem occurs when we still try to process broken topology from the > event on agent. Agent need to handle this failure with a warning. Currently > it just fails the whole command. > {code:java}ERROR 2020-12-10 06:30:09,350 CustomServiceOrchestrator.py:459 - > Caught an exception while executing custom service command: <type > 'exceptions.KeyError'>: 10; 10 > Traceback (most recent call last): > File "/usr/lib/ambari-agent/lib/ambari_agent/CustomServiceOrchestrator.py", > line 324, in runCommand > command = self.generate_command(command_header) > File "/usr/lib/ambari-agent/lib/ambari_agent/CustomServiceOrchestrator.py", > line 507, in generate_command > command_dict = self.configuration_builder.get_configuration(cluster_id, > service_name, component_name, required_config_timestamp) > File "/usr/lib/ambari-agent/lib/ambari_agent/ConfigurationBuilder.py", line > 43, in get_configuration > 'clusterHostInfo': self.topology_cache.get_cluster_host_info(cluster_id), > File "/usr/lib/ambari-agent/lib/ambari_agent/Utils.py", line 230, in > newFunction > return f(*args, **kw) > File "/usr/lib/ambari-agent/lib/ambari_agent/ClusterTopologyCache.py", line > 112, in get_cluster_host_info > hostnames = [self.hosts_to_id[cluster_id][host_id].hostName for host_id > in component_dict.hostIds] > KeyError: 10{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)