[
https://issues.apache.org/jira/browse/AURORA-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192533#comment-14192533
]
Bhuvan Arumugam commented on AURORA-898:
----------------------------------------
This issue is caused by fix for AURORA-84, wherein TaskConfig thrift message is
changed to include JobKey. The change on both client and scheduler is not
backward compatible. This cause 2 issues:
When using old client against new scheduler the job creation fail with
following error:
{code}
INFO] Response from scheduler: INVALID_REQUEST (message: Job key
JobKey(role:null, environment:null, name:null) is invalid.)
{code}
Schedule log indicate role, environment and name in JobKey struct are null.
{code}
I1031 21:27:27.073 THREAD362
org.apache.aurora.scheduler.thrift.aop.LoggingInterceptor.invoke:
createJob(JobConfiguration(key:JobKey(role:myrole, environment:prod,
name:myjob), owner:Identity(role:myrole, user:bhuvan), cronSchedule:null,
cronCollisionPolicy:KILL_EXISTING, taskConfig:TaskConfig(job:JobKey(role:null,
environment:null, name:null), owner:Identity(role:myrole, user:bhuvan),
environment:prod, jobName:myjob, isService:false, numCpus:1.0, ramMb:128,
diskMb:1024, priority:0, maxTaskFailures:1, production:false, constraints:[],
requestedPorts:[], taskLinks:{}, executorConfig:ExecutorConfig(name:BLANKED,
data:BLANKED), metadata:[]), instanceCount:1), null,
SessionKey(mechanism:UNAUTHENTICATED, data:50 D0 14 4C 71 0D 4C 80 80 4C 40))
{code}
When using new client against old scheduler the job is created and stay in
ASSIGNED state before going to LOST state after 5mins.
The fix is to upgrade both client and server at the same time. Is there a
better fix/workaround for this bug? May be fix scheduler to be compatible with
older client, ie dont default {{TaskConfig.JobKey}} to null, if it's not
specified?
> unable to kill a job that is in ASSIGNED state
> ----------------------------------------------
>
> Key: AURORA-898
> URL: https://issues.apache.org/jira/browse/AURORA-898
> Project: Aurora
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 0.5.0
> Reporter: Bhuvan Arumugam
>
> we unable to kill a job that's in ASSIGNED state. it's always reproducible,
> even with a hello world job.
> The {{aurora killall}} command give up after 5mins with this message:
> {code}
> .
> .
> DEBUG "POST /api HTTP/1.1" 200 None
> DEBUG] "POST /api HTTP/1.1" 200 None
> DEBUG] handle_response(): returning <Response [200]>
> DEBUG] Response from scheduler: OK (message: None)
> FATAL] Tasks were not killed in time.
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)