[ 
https://issues.apache.org/jira/browse/HADOOP-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430747#comment-16430747
 ] 

Jason Lowe commented on HADOOP-15372:
-------------------------------------

Thanks for the report, Miklos

bq. runCommand()}} can be called while/after Shell.getAllShells() returned all 
the shells to be cleaned up.  This new thread can avoid the clean up, so that 
the process held by it can be leaked causing leaked localized files/etc.

Yes, there's a small bug where Shell#runCommand should be synchronizing on 
CHILD_SHELLS as it starts the subprocess. That way it either won't be started 
before destroyAllShells is called or it will be part of the list as long as 
it's been started.  However I would argue it's outside the scope of the 
getAllShells and destoryAllShells APIs to prevent all future shells from being 
launched, as there may be a use case where someone wants to cleanup all current 
shells but still launch future ones. Its job is to kill all active ones which 
client code outside of Shell cannot do.

In the specific case of localizing, it looks like we need a second destroy pass 
after awaiting the shutdown of the executor to catch any shell that was trying 
to launch just as we destroyed the active ones.

{quote}I see another issue as well. [...] the try catch block does not cover 
all instructions after the process is started, so for example we can exit the 
thread and leak the process
{quote}
Yes that appears to be an issue as well.

[~ebadger] do you have some cycles to look into this?

> Race conditions and possible leaks in the Shell class
> -----------------------------------------------------
>
>                 Key: HADOOP-15372
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15372
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.10.0, 3.2.0
>            Reporter: Miklos Szegedi
>            Priority: Minor
>
> YARN-5641 introduced some cleanup code in the Shell class. It has a race 
> condition. {{Shell.
> runCommand()}} can be called while/after {{Shell.getAllShells()}} returned 
> all the shells to be cleaned up. This new thread can avoid the clean up, so 
> that the process held by it can be leaked causing leaked localized files/etc.
> I see another issue as well. {{Shell.runCommand()}} has a finally block with 
> a {{
> process.destroy();}} to clean up. However, the try catch block does not cover 
> all instructions after the process is started, so for example we can exit the 
> thread and leak the process, if {{
> timeOutTimer.schedule(timeoutTimerTask, timeOutInterval);}} causes an 
> exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to