[jira] [Commented] (MYRIAD-153) Placeholder tasks yarn_container_* is not cleaned after yarn job is complete.

2015-12-04 Thread Santosh Marella (JIRA)

[ 
https://issues.apache.org/jira/browse/MYRIAD-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042257#comment-15042257
 ] 

Santosh Marella commented on MYRIAD-153:


Thanks [~darinj]! Please go ahead.

> Placeholder tasks yarn_container_* is not cleaned after yarn job is complete.
> -
>
> Key: MYRIAD-153
> URL: https://issues.apache.org/jira/browse/MYRIAD-153
> Project: Myriad
>  Issue Type: Bug
>Reporter: Sarjeet Singh
>Assignee: Santosh Marella
> Attachments: Mesos_UI_screeshot_placeholder_tasks_running.png
>
>
> Observed the placeholder tasks for containers launched on FGS are still in 
> RUNNING state on mesos. These container tasks are not cleaned up properly 
> after job is finished completely.
> see screenshot attached for mesos UI with placeholder tasks still running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MYRIAD-153) Placeholder tasks yarn_container_* is not cleaned after yarn job is complete.

2015-11-03 Thread Santosh Marella (JIRA)

[ 
https://issues.apache.org/jira/browse/MYRIAD-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987572#comment-14987572
 ] 

Santosh Marella commented on MYRIAD-153:


[~sdaingade], [~sarjeet] and I investigated this problem last week. We've 
identified the following:

a. A M/R app master typically requests for slightly more containers than are 
required. I guess the reason behind the design is to keep some containers ready 
to go in case a few tasks fail. 
b. When there are no task failures, these containers are never used.
c. When the job finishes, these containers are released by the app master.
d. With FGS, the myriad executor aux service doesn't seem to get a 
stopContainer callback, since these containers are never physically launched.

For e.g. for container container_1442507909665_0003_01_43, there was no m/r
task that was really launched (perhaps speculative execution).
This container's placeholder task never finished.

smarella:~/scratch/bug20530$ grep container_1442507909665_0003_01_43
testrm.646ddf2c-5d5a-11e5-9651-0cc47a587d16.stderr  | grep Transitioned
15/09/17 10:26:26 INFO rmcontainer.RMContainerImpl:
container_1442507909665_0003_01_43 Container Transitioned from NEW to
RESERVED
15/09/17 10:26:31 INFO rmcontainer.RMContainerImpl:
container_1442507909665_0003_01_43 Container Transitioned from NEW to
ALLOCATED
15/09/17 10:26:32 INFO rmcontainer.RMContainerImpl:
container_1442507909665_0003_01_43 Container Transitioned from ALLOCATED to
ACQUIRED
15/09/17 10:26:33 INFO rmcontainer.RMContainerImpl:
container_1442507909665_0003_01_43 Container Transitioned from ACQUIRED to
RELEASED

For e.g. for container container_1442507909665_0003_01_43, there was m/r
task that launched.
This container's placeholder task finished correctly.

smarella:~/scratch/bug20530$ grep container_1442507909665_0003_01_42
testrm.646ddf2c-5d5a-11e5-9651-0cc47a587d16.stderr  | grep Transitioned
15/09/17 10:26:21 INFO rmcontainer.RMContainerImpl:
container_1442507909665_0003_01_42 Container Transitioned from NEW to
RESERVED
15/09/17 10:26:25 INFO rmcontainer.RMContainerImpl:
container_1442507909665_0003_01_42 Container Transitioned from NEW to
ALLOCATED
15/09/17 10:26:25 INFO rmcontainer.RMContainerImpl:
container_1442507909665_0003_01_42 Container Transitioned from ALLOCATED to
ACQUIRED
15/09/17 10:26:26 INFO rmcontainer.RMContainerImpl:
container_1442507909665_0003_01_42 Container Transitioned from ACQUIRED to
RUNNING
15/09/17 10:26:41 INFO rmcontainer.RMContainerImpl:
container_1442507909665_0003_01_42 Container Transitioned from RUNNING to
COMPLETED


[~sdaingade]proposed a solution for this: The executor aux service should send 
COMPLETE status
for all the remaining placeholder tasks after an application completes. There 
is a callback
in the aux services interface that's called once an application finishes.

I'll implement the fix and submit a PR.

> Placeholder tasks yarn_container_* is not cleaned after yarn job is complete.
> -
>
> Key: MYRIAD-153
> URL: https://issues.apache.org/jira/browse/MYRIAD-153
> Project: Myriad
>  Issue Type: Bug
>Reporter: Sarjeet Singh
> Attachments: Mesos_UI_screeshot_placeholder_tasks_running.png
>
>
> Observed the placeholder tasks for containers launched on FGS are still in 
> RUNNING state on mesos. These container tasks are not cleaned up properly 
> after job is finished completely.
> see screenshot attached for mesos UI with placeholder tasks still running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MYRIAD-153) Placeholder tasks yarn_container_* is not cleaned after yarn job is complete.

2015-11-02 Thread DarinJ (JIRA)

[ 
https://issues.apache.org/jira/browse/MYRIAD-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986651#comment-14986651
 ] 

DarinJ commented on MYRIAD-153:
---

Having a similar issue, have you looked in the stderr in your sandbox and/or 
your hadoop logs?  I've noticed this line in each task that gets stuck in 
running:

{quote}
15/11/03 03:44:25 WARN containermanager.ContainerManagerImpl: Event EventType: 
KILL_CONTAINER sent to absent container container_1446520127877_0004_01_000509
{quote}
Where container_X matches yarn_container_X.  I haven't had a 
chance to investigate further though.

> Placeholder tasks yarn_container_* is not cleaned after yarn job is complete.
> -
>
> Key: MYRIAD-153
> URL: https://issues.apache.org/jira/browse/MYRIAD-153
> Project: Myriad
>  Issue Type: Bug
>Reporter: Sarjeet Singh
> Attachments: Mesos_UI_screeshot_placeholder_tasks_running.png
>
>
> Observed the placeholder tasks for containers launched on FGS are still in 
> RUNNING state on mesos. These container tasks are not cleaned up properly 
> after job is finished completely.
> see screenshot attached for mesos UI with placeholder tasks still running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)