[jira] [Commented] (MESOS-10088) task stuck in TASK_STAGING following agent suspended and resumed

Charles (Jira) Fri, 17 Jan 2020 01:38:20 -0800


    [ 
https://issues.apache.org/jira/browse/MESOS-10088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017843#comment-17017843
 ]


Charles commented on MESOS-10088:
---------------------------------

Something interesting is this message:

{noformat}
W0116 09:43:29.840143   128 slave.cpp:2135] Agent  ignoring running task 
'gus:random_error_eventually_succeeds:job-2:05cfcdc4-11e3-cea2-51ab-003c5030a74d'
 because it was intended for old agent 9886c82f-1c92-41d8-bc47-bfd614e669af-S4
{noformat}

If we look at the code where the message is generated 
(https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L2200) we see:



{noformat}
  foreach (const TaskInfo& _task, tasks) {
    if (_task.slave_id() != info.id()) {
      LOG(WARNING)
        << "Agent " << info.id() << " ignoring running "
        << taskOrTaskGroup(_task, taskGroup) << " because "
        << "it was intended for old agent " << _task.slave_id();
      return;
    }
  }
{noformat}


So it seems that {noformat}info.id(){noformat} is empty.

> task stuck in  TASK_STAGING following agent suspended and resumed
> -----------------------------------------------------------------
>
>                 Key: MESOS-10088
>                 URL: https://issues.apache.org/jira/browse/MESOS-10088
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Charles
>            Priority: Major
>
> Hi,
>  
> I've been testing Mesos in the context of a framework I'm developing.
> As part of the integration tests, I send SIGSTOP/SIGCONT randomly to 
> processes, to simulate for example if they got blocked because of memory 
> pressure or whatever.
> While doing this, I encountered a situation where a task would remain in 
> TASK_STAGING forever.
>  
> Here's the log when the slave is resumed with SIGCONT:
>  
> {noformat}
> 2020-01-16 
> 09:43:29,797:45(0x7fbfd4ad9700):ZOO_ERROR@handle_socket_error_msg@1782: 
> Socket [127.0.0.1:40125] zk retcode=-4, errno=112(Host is down): failed while 
> receiving a server response
> 2020-01-16 09:43:29,797:45(0x7fbfd4ad9700):ZOO_WARN@zookeeper_interest@1597: 
> Exceeded deadline by 23153ms
> 2020-01-16 09:43:29,797:45(0x7fbfd4ad9700):ZOO_INFO@check_events@1764: 
> initiated connection to server [127.0.0.1:33437]
> I0116 09:43:29.798094 124 group.cpp:452] Lost connection to ZooKeeper, 
> attempting to reconnect ...
> I0116 09:43:29.798627 83 master.cpp:7083] Received register agent message 
> from slave(1)@192.168.65.76:51553 (lhpcb008.london.quadraturecapital.com)
> I0116 09:43:29.798707 83 master.cpp:7262] Registering agent at 
> slave(1)@192.168.65.76:51553 (lhpcb008.london.quadraturecapital.com) with id 
> 9886c82f-1c92-41d8-bc47-bfd614e669af-S4
> I0116 09:43:29.798784 83 registrar.cpp:487] Applied 1 operations in 40710ns; 
> attempting to update the registry
> I0116 09:43:29.798921 83 coordinator.cpp:348] Coordinator attempting to write 
> APPEND action at position 11
> I0116 09:43:29.799089 83 replica.cpp:541] Replica received write request for 
> position 11 from __req_res__(54)@127.0.0.1:46799
> I0116 09:43:29.799271 158 replica.cpp:541] Replica received write request for 
> position 11 from __req_res__(56)@127.0.0.1:46799
> 2020-01-16 
> 09:43:29,799:45(0x7fbfd4ad9700):ZOO_ERROR@handle_socket_error_msg@1800: 
> Socket [127.0.0.1:33437] zk retcode=-112, errno=116(Stale file handle): 
> sessionId=0x16fadbc14f50003 has expired.
> I0116 09:43:29.799561 97 replica.cpp:541] Replica received write request for 
> position 11 from __req_res__(55)@127.0.0.1:46799
> I0116 09:43:29.799762 129 group.cpp:511] ZooKeeper session expired
> I0116 09:43:29.799824 127 detector.cpp:152] Detected a new leader: None
> I0116 09:43:29.799873 127 slave.cpp:1344] Lost leading master
> I0116 09:43:29.799882 127 slave.cpp:1416] Detecting new master
> I0116 09:43:29.799893 127 task_status_update_manager.cpp:181] Pausing sending 
> task status updates
> I0116 09:43:29.799898 127 status_update_manager_process.hpp:379] Pausing 
> operation status update manager
> I0116 09:43:29.800115 80 replica.cpp:695] Replica received learned notice for 
> position 11 from log-network(1)@127.0.0.1:46799
> I0116 09:43:29.800271 81 registrar.cpp:544] Successfully updated the registry 
> in 1.467136ms
> I0116 09:43:29.800348 81 coordinator.cpp:348] Coordinator attempting to write 
> TRUNCATE action at position 12
> I0116 09:43:29.800462 81 replica.cpp:541] Replica received write request for 
> position 12 from __req_res__(57)@127.0.0.1:46799
> I0116 09:43:29.800529 82 hierarchical.cpp:955] Added agent 
> 9886c82f-1c92-41d8-bc47-bfd614e669af-S4 
> (lhpcb008.london.quadraturecapital.com) with cpus:4; mem:30844; disk:4096; 
> ports:[31000-32000] (offered or allocated: {})
> I0116 09:43:29.800549 84 master.cpp:7355] Registered agent 
> 9886c82f-1c92-41d8-bc47-bfd614e669af-S4 at slave(1)@192.168.65.76:51553 
> (lhpcb008.london.quadraturecapital.com) with cpus:4; mem:30844; disk:4096; 
> ports:[31000-32000]
> 2020-01-16 09:43:29,800:45(0x7fbfd72de700):ZOO_INFO@zookeeper_close@2579: 
> Freeing zookeeper resources for sessionId=0x16fadbc14f50003
> I0116 09:43:29.800671 81 master.cpp:10409] Sending offers [ 
> 9886c82f-1c92-41d8-bc47-bfd614e669af-O67 ] to framework 
> 9886c82f-1c92-41d8-bc47-bfd614e669af-0000 (gus) at 
> [email protected]:37061
> 2020-01-16 09:43:29,800:45(0x7fbfd62dc700):ZOO_INFO@log_env@753: Client 
> environment:zookeeper.version=zookeeper C client 3.4.8
> 2020-01-16 09:43:29,800:45(0x7fbfd62dc700):ZOO_INFO@log_env@757: Client 
> environment:host.name=lhpcb008
> 2020-01-16 09:43:29,800:45(0x7fbfd62dc700):ZOO_INFO@log_env@764: Client 
> environment:os.name=Linux
> 2020-01-16 09:43:29,800:45(0x7fbfd62dc700):ZOO_INFO@log_env@765: Client 
> environment:os.arch=4.15.0-72-generic
> 2020-01-16 09:43:29,800:45(0x7fbfd62dc700):ZOO_INFO@log_env@766: Client 
> environment:os.version=#81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019
> 2020-01-16 09:43:29,800:45(0x7fbfd62dc700):ZOO_INFO@log_env@774: Client 
> environment:user.name=charles
> 2020-01-16 09:43:29,800:45(0x7fbfd62dc700):ZOO_INFO@log_env@782: Client 
> environment:user.home=/home/charles
> 2020-01-16 09:43:29,800:45(0x7fbfd62dc700):ZOO_INFO@log_env@794: Client 
> environment:user.dir=/scratch/charles/bazel_base/519e0914f77a57b1c958838821782354/sandbox/linux-sandbox/11837/execroot/__main__/bazel-out/k8-opt/bin/cpp/tests/integration_test_2.runfiles/__main__
> 2020-01-16 09:43:29,800:45(0x7fbfd62dc700):ZOO_INFO@zookeeper_init@827: 
> Initiating client connection, 
> host=127.0.0.1:44283,127.0.0.1:33437,127.0.0.1:40125 sessionTimeout=10000 
> watcher=0x7fbfe5632340 sessionId=0 sessionPasswd=<null> 
> context=0x7fbfcc002f18 flags=0
> W0116 09:43:29.801195 126 slave.cpp:1559] Ignoring registration message from 
> [email protected]:46799 because it is not the expected master: None
> {noformat}
>  
> Here's the task being received by the master:
> {noformat}
> I0116 09:43:29.839668 83 master.cpp:4287] Adding task 
> gus:random_error_eventually_succeeds:job-2:05cfcdc4-11e3-cea2-51ab-003c5030a74d
>  with resources mem(allocated: *):1024; cpus(allocated: *):1 of framework 
> 9886c82f-1c92-41d8-bc47-bfd614e669af-0000 (gus) at 
> [email protected]:37061 on agent 
> 9886c82f-1c92-41d8-bc47-bfd614e669af-S4 at slave(1)@192.168.65.76:51553 
> (lhpcb008.london.quadraturecapital.com){noformat}
>  
> It's forwarded to the slave:
> {noformat}
> I0116 09:43:29.839741 83 master.cpp:5714] Launching task 
> gus:random_error_eventually_succeeds:job-2:05cfcdc4-11e3-cea2-51ab-003c5030a74d
>  of framework 9886c82f-1c92-41d8-bc47-bfd614e669af-0000 (gus) at 
> [email protected]:37061 with 
> resources 
> [{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":1.0},"type":"SCALAR"}]
>  on agent 9886c82f-1c92-41d8-bc47-bfd614e669af-S4 at 
> slave(1)@192.168.65.76:51553 (lhpcb008.london.quadraturecapital.com) on new 
> executor
> I0116 09:43:29.840129 128 slave.cpp:2130] Got assigned task 
> 'gus:random_error_eventually_succeeds:job-2:05cfcdc4-11e3-cea2-51ab-003c5030a74d'
>  for framework 9886c82f-1c92-41d8-bc47-bfd614e669af-0000{noformat}
>  
> But decides to ignore it:
> {noformat}
> W0116 09:43:29.840143 128 slave.cpp:2135] Agent ignoring running task 
> 'gus:random_error_eventually_succeeds:job-2:05cfcdc4-11e3-cea2-51ab-003c5030a74d'
>  because it was intended for old agent 
> 9886c82f-1c92-41d8-bc47-bfd614e669af-S4{noformat}
>  
> After that, all reconciliations returned the task as TASK_STAGING:
> {noformat}
> mesos_task_id {
>  value: 
> "gus:random_error_eventually_succeeds:job-2:05cfcdc4-11e3-cea2-51ab-003c5030a74d"
>  }
>  state: TASK_STAGING
>  source: "SOURCE_MASTER"
>  reason: "REASON_RECONCILIATION"
>  message: "Reconciliation: Latest task state"{noformat}
> Unfortunately this test run has the logs from the various processes (Mesos 
> masters and agents) intermingled so I can't easily extract them, although if 
> necessary I can probably try and reproduce it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (MESOS-10088) task stuck in TASK_STAGING following agent suspended and resumed

Reply via email to