[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-15 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059332#comment-15059332 ] sandflee commented on YARN-1197: seems not support increase memory and decrease cpu cores

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-15 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059362#comment-15059362 ] sandflee commented on YARN-1197: got it, Thanks,[~leftnoteasy]! > Support changing resourc

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059777#comment-15059777 ] sandflee commented on YARN-4138: 1, use Resources.fitsin(targetResource, lastConfirmedReso

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061346#comment-15061346 ] sandflee commented on YARN-1197: user application(long running) are running on our yarn pla

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061348#comment-15061348 ] sandflee commented on YARN-1197: user application(long running) are running on our yarn pla

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061349#comment-15061349 ] sandflee commented on YARN-1197: user application(long running) are running on our yarn pla

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061347#comment-15061347 ] sandflee commented on YARN-1197: user application(long running) are running on our yarn pla

[jira] [Commented] (YARN-1197) Support changing resources of an allocated container

2015-12-16 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15061577#comment-15061577 ] sandflee commented on YARN-1197: seems complicated for AM to do this, especially we added

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-17 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063576#comment-15063576 ] sandflee commented on YARN-4138: {quote} We should not update lastConfirmedResource in this

[jira] [Created] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-21 Thread sandflee (JIRA)
sandflee created YARN-4495: -- Summary: add a way to tell AM container increase/decrease request is invalid Key: YARN-4495 URL: https://issues.apache.org/jira/browse/YARN-4495 Project: Hadoop YARN Is

[jira] [Updated] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-21 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4495: --- Description: now RM may pass InvalidResourceRequestException to AM or just ignore the change request, the forme

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-24 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071403#comment-15071403 ] sandflee commented on YARN-4138: Hi, [~mding], sorry for the late reply, 1, If AM send tok

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-25 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071475#comment-15071475 ] sandflee commented on YARN-4138: + decreaseRequest = new SchedContainerChangeRequest(

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-25 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071492#comment-15071492 ] sandflee commented on YARN-4138: there seems a deadlock, in allocate and rollback logic we

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-27 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15072314#comment-15072314 ] sandflee commented on YARN-4138: Hi, [~mding], I'll open a new jira to track this, not to d

[jira] [Commented] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-27 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15072387#comment-15072387 ] sandflee commented on YARN-4495: RM will pass InvalidResourceRequestException to AM in belo

[jira] [Created] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2015-12-27 Thread sandflee (JIRA)
sandflee created YARN-4519: -- Summary: potential deadlock of CapacityScheduler between decrease container and assign containers Key: YARN-4519 URL: https://issues.apache.org/jira/browse/YARN-4519 Project: Had

[jira] [Updated] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2015-12-27 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4519: --- Description: In CapacityScheduler.allocate() , first get FiCaSchedulerApp sync lock, and may be get CapacitySc

[jira] [Created] (YARN-4520) FinishAppEvent is leaked in leveldb if no app's container running on this node

2015-12-27 Thread sandflee (JIRA)
sandflee created YARN-4520: -- Summary: FinishAppEvent is leaked in leveldb if no app's container running on this node Key: YARN-4520 URL: https://issues.apache.org/jira/browse/YARN-4520 Project: Hadoop YARN

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2015-12-27 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15072430#comment-15072430 ] sandflee commented on YARN-4138: when release containers , we didn't hold SchedulerApp's lo

[jira] [Updated] (YARN-4520) FinishAppEvent is leaked in leveldb if no app's container running on this node

2015-12-27 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4520: --- Attachment: YARN-4520.01.patch > FinishAppEvent is leaked in leveldb if no app's container running on this node

[jira] [Updated] (YARN-4520) FinishAppEvent is leaked in leveldb if no app's container running on this node

2015-12-27 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4520: --- Attachment: YARN-4520.02.patch fix checkstyle errors > FinishAppEvent is leaked in leveldb if no app's contain

[jira] [Updated] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-28 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4495: --- Attachment: YARN-4495.01.patch just protocol change, add FailedResourceChange to AllocateResponse, represents

[jira] [Commented] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-28 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073347#comment-15073347 ] sandflee commented on YARN-4495: Hi [~jianhe] [~wangda] [~mding] , do you think the change

[jira] [Commented] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-28 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073400#comment-15073400 ] sandflee commented on YARN-4495: Thanks [~mding], [~wangda], yes this could simple the cod

[jira] [Commented] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-28 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073472#comment-15073472 ] sandflee commented on YARN-4495: [~mding] [~wangda] one problem, seems hadoop rpc could onl

[jira] [Commented] (YARN-3328) There's no way to rebuild containers Managed by NMClientAsync If AM restart

2015-12-28 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073587#comment-15073587 ] sandflee commented on YARN-3328: we fix this problem by removing container state machine in

[jira] [Commented] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2015-12-29 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073622#comment-15073622 ] sandflee commented on YARN-4519: sorry, I don't understand 1, why should we put compute de

[jira] [Commented] (YARN-4519) potential deadlock of CapacityScheduler between decrease container and assign containers

2015-12-29 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074038#comment-15074038 ] sandflee commented on YARN-4519: got it, thanks [~mding]! > potential deadlock of Capacit

[jira] [Commented] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-29 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074049#comment-15074049 ] sandflee commented on YARN-4495: the main problem is we couldn't pass containerId to inva

[jira] [Commented] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-29 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074052#comment-15074052 ] sandflee commented on YARN-4495: better to pass why resource change request is failed. > a

[jira] [Commented] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-29 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074473#comment-15074473 ] sandflee commented on YARN-4495: to [~mding], 1, we have a StateMachine in AM to track e

[jira] [Commented] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-30 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075043#comment-15075043 ] sandflee commented on YARN-4495: Hi, [~leftnoteasy], I do a simple test throwing a exceptio

[jira] [Created] (YARN-4528) decreaseConainer Message maybe lost if NM restart

2015-12-30 Thread sandflee (JIRA)
sandflee created YARN-4528: -- Summary: decreaseConainer Message maybe lost if NM restart Key: YARN-4528 URL: https://issues.apache.org/jira/browse/YARN-4528 Project: Hadoop YARN Issue Type: Bug

[jira] [Updated] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2015-12-30 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4528: --- Summary: decreaseContainer Message maybe lost if NM restart (was: decreaseConainer Message maybe lost if NM re

[jira] [Commented] (YARN-4495) add a way to tell AM container increase/decrease request is invalid

2015-12-30 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075570#comment-15075570 ] sandflee commented on YARN-4495: thanks [~wangda], hoping more suggestions > add a way to

[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2015-12-30 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075772#comment-15075772 ] sandflee commented on YARN-4528: since in most cases container size is not changed, so I pr

[jira] [Updated] (YARN-4520) FinishAppEvent is leaked in leveldb if no app's container running on this node

2016-01-03 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4520: --- Description: once we restart nodemanager we see many logs like : 2015-12-28 11:59:18,725 WARN org.apache.hadoo

[jira] [Updated] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4528: --- Attachment: YARN-4528.01.patch 1, pending container decrease msg util next heartbeat. 2, nodemanager#allocate d

[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081322#comment-15081322 ] sandflee commented on YARN-4528: HI, [~mding], container decrease msg is passed like conta

[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081342#comment-15081342 ] sandflee commented on YARN-4528: [~jianhe] reviewing the code of how containers complete ms

[jira] [Commented] (YARN-4528) decreaseContainer Message maybe lost if NM restart

2016-01-04 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081969#comment-15081969 ] sandflee commented on YARN-4528: thanks [~mding], yes this could happen, but rarely. should

[jira] [Updated] (YARN-4581) thread leak makes RM crash while RM is recovering

2016-01-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4581: --- Description: we enable ApplicationHistoryWriter, and find thousands of Errors: {quote} 2016-01-08 03:13:03,44

[jira] [Created] (YARN-4581) thread leak makes RM crash while RM is recovering

2016-01-11 Thread sandflee (JIRA)
sandflee created YARN-4581: -- Summary: thread leak makes RM crash while RM is recovering Key: YARN-4581 URL: https://issues.apache.org/jira/browse/YARN-4581 Project: Hadoop YARN Issue Type: Bug

[jira] [Updated] (YARN-4581) thread leak makes RM crash while RM is recovering

2016-01-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4581: --- Attachment: YARN-4581.01.patch simple fix thread leak problem. > thread leak makes RM crash while RM is recov

[jira] [Commented] (YARN-4581) thread leak makes RM crash while RM is recovering

2016-01-12 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15095369#comment-15095369 ] sandflee commented on YARN-4581: thanks [~Naganarasimha] [~djp], our cluster is based on 2.

[jira] [Commented] (YARN-4581) AHS writer thread leak makes RM crash while RM is recovering

2016-01-15 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102942#comment-15102942 ] sandflee commented on YARN-4581: thanks Junping, Naga, Vinod! > AHS writer thread leak mak

[jira] [Created] (YARN-4646) AMRMClient crashed when RM transition from active to standby

2016-01-26 Thread sandflee (JIRA)
sandflee created YARN-4646: -- Summary: AMRMClient crashed when RM transition from active to standby Key: YARN-4646 URL: https://issues.apache.org/jira/browse/YARN-4646 Project: Hadoop YARN Issue Typ

[jira] [Commented] (YARN-4646) AMRMClient crashed when RM transition from active to standby

2016-01-26 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118815#comment-15118815 ] sandflee commented on YARN-4646: I propose not passing Interrupted exception to client whil

[jira] [Commented] (YARN-4646) AMRMClient crashed when RM transition from active to standby

2016-01-27 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118939#comment-15118939 ] sandflee commented on YARN-4646: Thanks [~zxu], they're the same issue, but patch in MAPRED

[jira] [Commented] (YARN-4646) AMRMClient crashed when RM transition from active to standby

2016-01-27 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15120626#comment-15120626 ] sandflee commented on YARN-4646: MR AM catches most remote exceptions and retry, I don't kn

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2016-02-04 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133627#comment-15133627 ] sandflee commented on YARN-4138: Hi, [~mding], there may some cases not user/app error, 1,

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2016-02-04 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133758#comment-15133758 ] sandflee commented on YARN-4138: to simple the race condition process, could we reject the

[jira] [Created] (YARN-4672) container resource increased msg may lost if nm restart

2016-02-04 Thread sandflee (JIRA)
sandflee created YARN-4672: -- Summary: container resource increased msg may lost if nm restart Key: YARN-4672 URL: https://issues.apache.org/jira/browse/YARN-4672 Project: Hadoop YARN Issue Type: Bug

[jira] [Commented] (YARN-4672) container resource increased msg may lost if nm restart

2016-02-04 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133783#comment-15133783 ] sandflee commented on YARN-4672: This will trigger container resource rollback logic, thoug

[jira] [Created] (YARN-4673) race condition in ResourceTrackerService#nodeHeartBeat while processing deduplicated msg

2016-02-04 Thread sandflee (JIRA)
sandflee created YARN-4673: -- Summary: race condition in ResourceTrackerService#nodeHeartBeat while processing deduplicated msg Key: YARN-4673 URL: https://issues.apache.org/jira/browse/YARN-4673 Project: Had

[jira] [Commented] (YARN-4138) Roll back container resource allocation after resource increase token expires

2016-02-10 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142052#comment-15142052 ] sandflee commented on YARN-4138: looks good to me too, thanks [~mding] > Roll back contain

[jira] [Updated] (YARN-4673) race condition in ResourceTrackerService#nodeHeartBeat while processing deduplicated msg

2016-02-24 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4673: --- Attachment: YARN-4673.01.patch > race condition in ResourceTrackerService#nodeHeartBeat while processing > ded

[jira] [Commented] (YARN-4673) race condition in ResourceTrackerService#nodeHeartBeat while processing deduplicated msg

2016-02-25 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15166923#comment-15166923 ] sandflee commented on YARN-4673: Hi, [~ozawa], in ResourceTrackService we may concurrently

[jira] [Created] (YARN-4740) container complete msg may lost while AM restart in race condition

2016-02-25 Thread sandflee (JIRA)
sandflee created YARN-4740: -- Summary: container complete msg may lost while AM restart in race condition Key: YARN-4740 URL: https://issues.apache.org/jira/browse/YARN-4740 Project: Hadoop YARN Iss

[jira] [Updated] (YARN-4740) container complete msg may lost while AM restart in race condition

2016-02-25 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4740: --- Attachment: YARN-4740.01.patch put containers in finishedContainersSentToAM back to justFinishedContainer if

[jira] [Commented] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue

2016-02-28 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171341#comment-15171341 ] sandflee commented on YARN-4741: Hi,[~sjlee0], 1, does the num of FINISHED_CONTAINERS_PUL

[jira] [Commented] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue

2016-02-28 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171378#comment-15171378 ] sandflee commented on YARN-4741: one race condition may cause the "Invalid event FINISHED_

[jira] [Commented] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue

2016-02-29 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173395#comment-15173395 ] sandflee commented on YARN-4741: without the fix of YARN-3990 and YARN-3896, our rm was flo

[jira] [Updated] (YARN-4740) container complete msg may lost while AM restart in race condition

2016-03-02 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4740: --- Attachment: YARN-4740.02.patch > container complete msg may lost while AM restart in race condition > -

[jira] [Commented] (YARN-4740) container complete msg may lost while AM restart in race condition

2016-03-02 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176947#comment-15176947 ] sandflee commented on YARN-4740: thanks for your suggest, attach a new patch to fix these.

[jira] [Commented] (YARN-4740) container complete msg may lost while AM restart in race condition

2016-03-03 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15178708#comment-15178708 ] sandflee commented on YARN-4740: yes, this patch ensure AM receive at least one container c

[jira] [Commented] (YARN-4763) RMApps Page crashes with NPE

2016-03-06 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182437#comment-15182437 ] sandflee commented on YARN-4763: I think a general way fix for this is we should get rmappa

[jira] [Commented] (YARN-4763) RMApps Page crashes with NPE

2016-03-08 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184713#comment-15184713 ] sandflee commented on YARN-4763: yes, thanks for pointing this > RMApps Page crashes with

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-05 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227703#comment-15227703 ] sandflee commented on YARN-4924: In YARN-4051, we also had containers leak from NEW to DONE

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-06 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229486#comment-15229486 ] sandflee commented on YARN-4924: thanks [~nroberts], another thought, seems it's not nesses

[jira] [Updated] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-08 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4924: --- Attachment: YARN-4924.01.patch remove FINISH_APP related code in NM > NM recovery race can lead to container n

[jira] [Commented] (YARN-4740) AM may not receive the container complete msg when it restarts

2016-04-08 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232971#comment-15232971 ] sandflee commented on YARN-4740: thanks [~jianhe] for reviewing and committing! > AM may n

[jira] [Updated] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-08 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4924: --- Attachment: YARN-4924.02.patch > NM recovery race can lead to container not cleaned up > --

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-08 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233085#comment-15233085 ] sandflee commented on YARN-4924: {quote} I don't think removeDeprecatedKeys is an appropria

[jira] [Created] (YARN-4936) FileInputStream should be closed explicitly in NMWebService#getLogs

2016-04-08 Thread sandflee (JIRA)
sandflee created YARN-4936: -- Summary: FileInputStream should be closed explicitly in NMWebService#getLogs Key: YARN-4936 URL: https://issues.apache.org/jira/browse/YARN-4936 Project: Hadoop YARN Is

[jira] [Updated] (YARN-4936) FileInputStream should be closed explicitly in NMWebService#getLogs

2016-04-08 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4936: --- Attachment: YARN-4936.01.patch > FileInputStream should be closed explicitly in NMWebService#getLogs >

[jira] [Created] (YARN-4939) the decommissioning Node should keep alive if NM restart

2016-04-11 Thread sandflee (JIRA)
sandflee created YARN-4939: -- Summary: the decommissioning Node should keep alive if NM restart Key: YARN-4939 URL: https://issues.apache.org/jira/browse/YARN-4939 Project: Hadoop YARN Issue Type: B

[jira] [Updated] (YARN-4939) the decommissioning Node should keep alive if NM restart

2016-04-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4939: --- Attachment: YARN-4939.01.patch > the decommissioning Node should keep alive if NM restart > --

[jira] [Created] (YARN-4940) yarn node -list -all failed if RM start with decommissioned node

2016-04-11 Thread sandflee (JIRA)
sandflee created YARN-4940: -- Summary: yarn node -list -all failed if RM start with decommissioned node Key: YARN-4940 URL: https://issues.apache.org/jira/browse/YARN-4940 Project: Hadoop YARN Issue

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235369#comment-15235369 ] sandflee commented on YARN-4924: thanks [~jlowe], I added @Deprecated to FINISHED_APP_KEY_

[jira] [Updated] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4924: --- Attachment: YARN-4924.03.patch > NM recovery race can lead to container not cleaned up > --

[jira] [Commented] (YARN-4940) yarn node -list -all failed if RM start with decommissioned node

2016-04-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235388#comment-15235388 ] sandflee commented on YARN-4940: seems not, they are all caused by YARN-3102 > yarn node -

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236072#comment-15236072 ] sandflee commented on YARN-4924: >From the interface of DB, createWriteBatch didn't not th

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236115#comment-15236115 ] sandflee commented on YARN-4924: in case of createWriteBatch throws runtime Exception, see

[jira] [Updated] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4924: --- Attachment: YARN-4924.04.patch > NM recovery race can lead to container not cleaned up > --

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236128#comment-15236128 ] sandflee commented on YARN-4924: Thanks [~jlowe], not noticed that DBException is a RUNTIM

[jira] [Updated] (YARN-4939) the decommissioning Node should keep alive if NM restart

2016-04-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4939: --- Attachment: YARN-4939.02.patch ./bin/yarn node -list -states DECOMMISSIONING couldn't get the decommissionin

[jira] [Updated] (YARN-4939) the decommissioning Node should keep alive if NM restart

2016-04-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4939: --- Attachment: (was: YARN-4939.02.patch) > the decommissioning Node should keep alive if NM restart > ---

[jira] [Updated] (YARN-4939) the decommissioning Node should keep alive if NM restart

2016-04-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4939: --- Attachment: YARN-4939.02.patch > the decommissioning Node should keep alive if NM restart > --

[jira] [Commented] (YARN-2567) Add a percentage-node threshold for RM to wait for new allocations after restart/failover

2016-04-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236459#comment-15236459 ] sandflee commented on YARN-2567: Hi , [~vinodkv], could you assign this to me, I'd like to

[jira] [Commented] (YARN-2567) Add a percentage-node threshold for RM to wait for new allocations after restart/failover

2016-04-11 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236701#comment-15236701 ] sandflee commented on YARN-2567: The main idea is to lazily store NM status, if RM failover

[jira] [Commented] (YARN-2567) Add a percentage-node threshold for RM to wait for new allocations after restart/failover

2016-04-12 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236724#comment-15236724 ] sandflee commented on YARN-2567: there maybe one problem that if NM recovered as a finished

[jira] [Updated] (YARN-4940) yarn node -list -all failed if RM start with decommissioned node

2016-04-12 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4940: --- Attachment: YARN-4940.01.patch > yarn node -list -all failed if RM start with decommissioned node > ---

[jira] [Updated] (YARN-4940) yarn node -list -all failed if RM start with decommissioned node

2016-04-12 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4940: --- Attachment: YARN-4940.02.patch > yarn node -list -all failed if RM start with decommissioned node > ---

[jira] [Commented] (YARN-4940) yarn node -list -all failed if RM start with decommissioned node

2016-04-12 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237007#comment-15237007 ] sandflee commented on YARN-4940: rather than converting UnknownNodeId , using NodeId seems

[jira] [Commented] (YARN-4940) yarn node -list -all failed if RM start with decommissioned node

2016-04-12 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237387#comment-15237387 ] sandflee commented on YARN-4940: thanks [~kshukla], the test failures seems not related, I

[jira] [Commented] (YARN-2567) Add a percentage-node threshold for RM to wait for new allocations after restart/failover

2016-04-13 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239425#comment-15239425 ] sandflee commented on YARN-2567: Thanks [~jlowe], agree that a asynchronous state store wi

[jira] [Updated] (YARN-4939) the decommissioning Node should keep alive if NM restart

2016-04-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4939: --- Attachment: YARN-4939.03.patch > the decommissioning Node should keep alive if NM restart > --

[jira] [Updated] (YARN-4939) the decommissioning Node should keep alive if NM restart

2016-04-14 Thread sandflee (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandflee updated YARN-4939: --- Attachment: YARN-4939.04.patch > the decommissioning Node should keep alive if NM restart > --

  1   2   3   4   5   >