[
https://issues.apache.org/jira/browse/SLIDER-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934575#comment-14934575
]
Youjie Chen commented on SLIDER-939:
------------------------------------
Update on this:
We do more testing with Slider HBase app, the same scenario as before: we have
6 nodes: 1 for HBase master, 5 for HBase regionserver(set 51% memory of each
node to ensure this). we do following steps with yarn and hbase user
respectively:
1) flex up HBase regionserver from 5 to 6, there would be one outstanding
request pending and some nodes reserved for the container request, expected.
2) flex down back to 5 nodes
Then there would be 2 different behaviors with 2 users:
1) if deploy and run HBase instance as yarn user
The outstanding request will be cancelled and the reserved nodes would be
unreserved from YARN side, YARN log is a below:
(AppSchedulingInfo.java:updateResourceRequests(148)) - update:
application=application_1443174197731_0003 request={Priority: 1073741826,
Capability: <memory:51200, vCores:16>, # Containers: 0, Location: *, Relax
Locality: true}
2) if deploy and run HBase instance as hbase user
The nodes reserved will not be unreserved, and YARN log looks like:
scheduler.AppSchedulingInfo
(AppSchedulingInfo.java:updateResourceRequests(148)) - update:
application=application_1443170550563_0003 request={Priority: 1073741826,
Capability: <memory:51200, vCores:16>, # Containers: 5, Location: *, Relax
Locality: true}
Compared the 2 YARN outputs: we can see the difference: the first case with
yarn user, when do update resources, the containers request become 0, so the
reserved nodes will be unreserved and new requests can be accpeted, while the
second case with hbase user, the containers request is still 5. so the nodes
reserved will still be reserved, so block other jobs container request.
It looks like this issue is related to the user that deploy and run the
instance, any idea on this ? Thanks !
> flex down does not cancel the outstanding request
> -------------------------------------------------
>
> Key: SLIDER-939
> URL: https://issues.apache.org/jira/browse/SLIDER-939
> Project: Slider
> Issue Type: Bug
> Components: core
> Affects Versions: Slider 0.80
> Environment: Hadoop 2.7.1
> Slider 0.80.0
> Reporter: Youjie Chen
> Assignee: Steve Loughran
> Labels: patch
> Fix For: Slider 0.81
>
>
> I run slider app on a 6 nodes cluster. To ensure there is only one
> comonent(worker) instance on each node, I set yarn.memory to 51% of the total
> memory.
> Then I flex up to 7 workers, there would be one worker request(outstanding)
> that will never be met, this is expected.
> Then I flexed down back to 6 workers, and any container request for any job
> would be blocked even if there are plenty of memory/core for the job, From RM
> log, we can see there are continuous output:
> capacity.CapacityScheduler
> (CapacityScheduler.java:allocateContainersToNode(1240)) - Skipping scheduling
> since node test.example.com:45454 is reserved by application
> appattempt_1442384698868_0008_000001
> It seems the outstanding requests are not actually cancelled in the
> requesting container queue but keep trying to request.
> After I flexed down to 5 workers, the other blocked jobs can run.
> This is related to JIRA https://issues.apache.org/jira/browse/SLIDER-490
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)