Andrew Onischuk created AMBARI-24201:
----------------------------------------

             Summary: Command reschedule does not work causing blueprint 
deployments to timeout  
                 Key: AMBARI-24201
                 URL: https://issues.apache.org/jira/browse/AMBARI-24201
             Project: Ambari
          Issue Type: Bug
            Reporter: Andrew Onischuk
            Assignee: Andrew Onischuk
             Fix For: 2.7.0
         Attachments: AMBARI-24201.patch, AMBARI-24201.patch

During stage timeout/failure of devilery during blueprint install server
usually reschedules running command. By sending cancel command along with
repeated execution command.

The bug is that agent cancels the command which needs to be newly scheduled.

    
    
    2018-06-27 01:34:58,105  WARN [agent-message-retry-0] MessageEmitter:255 - 
Reschedule execution command emitting, retry: 1, messageId: 19
    
    
    
    ..., u'cancelCommands': [{u'commandType': u'CANCEL_COMMAND', 
u'target_task_id': 145, u'reason': u'Stage timeout'}]}}, 
u'requiredConfigTimestamp': 1530060845474}
    INFO 2018-06-27 01:34:58,121 ActionQueue.py:115 - Canceling command with 
taskId = 145
    INFO 2018-06-27 01:34:58,121 ActionQueue.py:134 - Canceling 
EXECUTION_COMMAND for service ZOOKEEPER and role ZOOKEEPER_CLIENT with taskId 
145
    WARNING 2018-06-27 01:34:58,121 CustomServiceOrchestrator.py:129 - Unable 
to find process associated with taskId = 145
    INFO 2018-06-27 01:34:58,122 ActionQueue.py:103 - Adding EXECUTION_COMMAND 
for role ZOOKEEPER_CLIENT for service ZOOKEEPER of cluster_id 2 to the queue.
    INFO 2018-06-27 01:34:58,122 security.py:135 - Event to server at 
/reports/responses (correlation_id=870): {'status': 'OK', 'messageId': '19'}
    INFO 2018-06-27 01:34:58,142 __init__.py:57 - Event from server at /user/ 
(correlation_id=870): {u'status': u'OK'}
    INFO 2018-06-27 01:34:59,293 ActionQueue.py:238 - Executing command with id 
= 10-0, taskId = 145 for role = ZOOKEEPER_CLIENT of cluster_id 2.
    INFO 2018-06-27 01:34:59,294 security.py:135 - Event to server at 
/reports/commands_status (correlation_id=871): {'clusters': {u'2': [{'status': 
'IN_PROGRESS', 'taskId': 145, 'tmpout': 
'/var/lib/ambari-agent/data/output-145.txt', 'roleCommand': u'INSTALL', 
'structuredOut': '/var/lib/ambari-agent/data/structured-out-145.json', 
'clusterId': u'2', 'serviceName': u'ZOOKEEPER', 'role': u'ZOOKEEPER_CLIENT', 
'actionId': u'10-0', 'tmperr': '/var/lib/ambari-agent/data/errors-145.txt'}]}}
    INFO 2018-06-27 01:34:59,295 ActionQueue.py:279 - Command execution 
metadata - taskId = 145, retry enabled = True, max retry duration (sec) = 1200, 
log_output = True
    INFO 2018-06-27 01:34:59,296 ActionQueue.py:285 - Command with taskId = 145 
canceled
    ERROR 2018-06-27 01:34:59,296 ActionQueue.py:221 - Exception while 
processing EXECUTION_COMMAND command
    Traceback (most recent call last):
      File "/usr/lib/ambari-agent/lib/ambari_agent/ActionQueue.py", line 214, 
in process_command
        self.execute_command(command)
      File "/usr/lib/ambari-agent/lib/ambari_agent/ActionQueue.py", line 354, 
in execute_command
        commandresult['stdout'] += '\n\nCommand completed successfully!\n' if 
status == self.COMPLETED_STATUS else '\n\nCommand failed after ' + 
str(numAttempts) + ' tries\n'
    UnboundLocalError: local variable 'commandresult' referenced before 
assignment
    





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to