[ 
https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7872:
---------------------------------------
    Comment: was deleted

(was: The problem is likely in the HTTP adapter. [Java side of the 
adapter|https://github.com/mesosphere/mesos-http-adapter/blob/master/src/main/java/com/mesosphere/mesos/http-adapter/MesosToSchedulerDriverAdapter.java]
 sends a {{SUBSCRIBE}} request that never completes, due to an error. That 
error is transferred to the [C++ side of the 
adapter|https://github.com/apache/mesos/blob/364abfc1bed8543b984ebd3712047b5ed8a109d2/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp#L550],
 but is not transmitted to the java side, because {{SUBSCRIBED}} [has not 
succeeded|https://github.com/apache/mesos/blob/364abfc1bed8543b984ebd3712047b5ed8a109d2/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp#L699]
 yet! Deadlock.

A fix here would be allowing {{ERROR}} events to go through even if the 
scheduler has not subscribed yet.)

> Scheduler hang when registration fails (due to bad role)
> --------------------------------------------------------
>
>                 Key: MESOS-7872
>                 URL: https://issues.apache.org/jira/browse/MESOS-7872
>             Project: Mesos
>          Issue Type: Bug
>          Components: scheduler driver
>    Affects Versions: 1.4.0
>            Reporter: Till Toenshoff
>              Labels: framework, reliability, scheduler
>
> I'm finding that if framework registration fails, the mesos driver client 
> will hang indefinitely with the following output:
> {noformat}
> I0809 20:04:22.479391    73 sched.cpp:1187] Got error ''FrameworkInfo.role' 
> is not a valid role: Role '/test/role/slashes' cannot start with a slash'
> I0809 20:04:22.479658    73 sched.cpp:2055] Asked to abort the driver
> I0809 20:04:22.479843    73 sched.cpp:1233] Aborting framework 
> {noformat}
> I'd have expected one or both of the following:
> - SchedulerDriver.run() should have exited with a failed Proto.Status of some 
> form
> - Scheduler.error() should have been invoked when the "Got error" occurred
> Steps to reproduce:
> - Launch a scheduler instance, have it register with a known-bad framework 
> info. In this case a role containing slashes was used
> - Observe that the scheduler continues in a TASK_RUNNING state despite the 
> failed registration. From all appearances it looks like the Scheduler 
> implementation isn't invoked at all
> I'd guess that because this failure happens before framework registration, 
> there's some error handling that isn't fully initialized at this point.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to