[ 
https://issues.apache.org/jira/browse/MESOS-7872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121387#comment-16121387
 ] 

Alexander Rukletsov commented on MESOS-7872:
--------------------------------------------

I've tried to reproduce this issue using a slightly modified 
{{no-executor-framework}}. Here is the output I get:
{noformat}
alex@alexr: ~/Projects/mesos/build/default $ ./src/no-executor-framework 
--master=127.0.0.1:5050
I0810 11:55:46.766144 1993596928 sched.cpp:232] Version: 1.4.0
I0810 11:55:46.766348 1993596928 sched.cpp:2090] Awaiting latch
I0810 11:55:46.771299 3211264 sched.cpp:336] New master detected at 
master@127.0.0.1:5050
I0810 11:55:46.774588 3211264 sched.cpp:352] No credentials provided. 
Attempting to register without authentication
I0810 11:55:46.792697 2674688 sched.cpp:1187] Got error ''FrameworkInfo.role' 
is not a valid role: Role '/test/rt' cannot start with a slash'
I0810 11:55:46.792721 2674688 sched.cpp:2055] Asked to abort the driver
E0810 11:55:46.792738 2674688 no_executor_framework.cpp:216] 
'FrameworkInfo.role' is not a valid role: Role '/test/rt' cannot start with a 
slash
I0810 11:55:46.792752 2674688 sched.cpp:1233] Aborting framework 
E0810 11:55:46.792788 4820992 process.cpp:2584] Failed to shutdown socket with 
fd 9, address 192.168.1.113:56500: Socket is not connected
I0810 11:55:46.792866 1993596928 sched.cpp:2092] Latch is triggered
I0810 11:55:46.792881 1993596928 sched.cpp:2021] Asked to stop the driver
{noformat}
If I remove 
[{{driver->stop}}|https://github.com/apache/mesos/blob/2cea83653afcf6d7470242379809645bfe009016/src/examples/no_executor_framework.cpp#L398],
 the scheduler exits anyway:
{noformat}
alex@alexr: ~/Projects/mesos/build/default $ ./src/no-executor-framework 
--master=127.0.0.1:5050
I0810 12:00:46.115882 1993596928 sched.cpp:232] Version: 1.4.0
I0810 12:00:46.116058 1993596928 sched.cpp:2090] Awaiting latch
I0810 12:00:46.118584 2674688 sched.cpp:336] New master detected at 
master@127.0.0.1:5050
I0810 12:00:46.118834 2674688 sched.cpp:352] No credentials provided. 
Attempting to register without authentication
I0810 12:00:46.120816 4284416 sched.cpp:1187] Got error ''FrameworkInfo.role' 
is not a valid role: Role '/test/role' cannot start with a slash'
I0810 12:00:46.120842 4284416 sched.cpp:2055] Asked to abort the driver
E0810 12:00:46.120847 4820992 process.cpp:2584] Failed to shutdown socket with 
fd 9, address 192.168.1.113:57081: Socket is not connected
E0810 12:00:46.120869 4284416 no_executor_framework.cpp:216] 
'FrameworkInfo.role' is not a valid role: Role '/test/role' cannot start with a 
slash
I0810 12:00:46.120895 4284416 sched.cpp:1233] Aborting framework 
I0810 12:00:46.121004 1993596928 sched.cpp:2092] Latch is triggered
{noformat}
Can you share the code of you scheduler, especially the part where you create 
and wait for the driver?

> Scheduler hang when registration fails (due to bad role)
> --------------------------------------------------------
>
>                 Key: MESOS-7872
>                 URL: https://issues.apache.org/jira/browse/MESOS-7872
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.4.0
>            Reporter: Till Toenshoff
>              Labels: framework, scheduler
>
> I'm finding that if framework registration fails, the mesos driver client 
> will hang indefinitely with the following output:
> {noformat}
> I0809 20:04:22.479391    73 sched.cpp:1187] Got error ''FrameworkInfo.role' 
> is not a valid role: Role '/test/role/slashes' cannot start with a slash'
> I0809 20:04:22.479658    73 sched.cpp:2055] Asked to abort the driver
> I0809 20:04:22.479843    73 sched.cpp:1233] Aborting framework 
> {noformat}
> I'd have expected one or both of the following:
> - SchedulerDriver.run() should have exited with a failed Proto.Status of some 
> form
> - Scheduler.error() should have been invoked when the "Got error" occurred
> Steps to reproduce:
> - Launch a scheduler instance, have it register with a known-bad framework 
> info. In this case a role containing slashes was used
> - Observe that the scheduler continues in a TASK_RUNNING state despite the 
> failed registration. From all appearances it looks like the Scheduler 
> implementation isn't invoked at all
> I'd guess that because this failure happens before framework registration, 
> there's some error handling that isn't fully initialized at this point.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to