[ http://issues.apache.org/jira/browse/HADOOP-563?page=all ]

Sameer Paranjpye updated HADOOP-563:
------------------------------------

    Component/s: dfs
    Description: 
In the current DFS client implementation, there is one thread responsible for 
renewing leases. If for whatever reason, that thread runs behind, the lease may 
get expired. That causes the client gets a lease expiration exception when 
writing a block. The consequence of that is very devastating: the client can no 
longer write to the file, and all the partial results up to that point are 
gone! This is especially costly for some map reduce jobs where a reducer may 
take hours or even days to sort the intermediate results before the actual 
reducing work can start.

The problem will be solved if the flush method of  DFS client can renew lease 
on demand. That is, it should try to re-new lease  when it catches a lease 
expiration exception. That way,  even when under heavy load and the lease 
renewing thread runs behind, the reducer  task (or what ever tasks use that 
client) can preceed.  That will be a huge saving in some cases (where sorting 
intermediate results take a long time to finish). We can set a limit on the 
number of retries, and may even make it configurable (or changeable at 
runtime). 

The namenode can use a different expiration time that is much higher than the 
current 1 minute lease expiration time for cleaning  up the abandoned unclosed 
files.

 

  was:

In the current DFS client implementation, there is one thread responsible for 
renewing leases. If for whatever reason, that thread runs behind, the lease may 
get expired. That causes the client gets a lease expiration exception when 
writing a block. The consequence of that is very devastating: the client can no 
longer write to the file, and all the partial results up to that point are 
gone! This is especially costly for some map reduce jobs where a reducer may 
take hours or even days to sort the intermediate results before the actual 
reducing work can start.

The problem will be solved if the flush method of  DFS client can renew lease 
on demand. That is, it should try to re-new lease  when it catches a lease 
expiration exception. That way,  even when under heavy load and the lease 
renewing thread runs behind, the reducer  task (or what ever tasks use that 
client) can preceed.  That will be a huge saving in some cases (where sorting 
intermediate results take a long time to finish). We can set a limit on the 
number of retries, and may even make it configurable (or changeable at 
runtime). 

The namenode can use a different expiration time that is much higher than the 
current 1 minute lease expiration time for cleaning  up the abandoned unclosed 
files.

 


> DFS client should try to re-new lease if it gets a lease expiration exception 
> when it adds a block to a file
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-563
>                 URL: http://issues.apache.org/jira/browse/HADOOP-563
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Runping Qi
>
> In the current DFS client implementation, there is one thread responsible for 
> renewing leases. If for whatever reason, that thread runs behind, the lease 
> may get expired. That causes the client gets a lease expiration exception 
> when writing a block. The consequence of that is very devastating: the client 
> can no longer write to the file, and all the partial results up to that point 
> are gone! This is especially costly for some map reduce jobs where a reducer 
> may take hours or even days to sort the intermediate results before the 
> actual reducing work can start.
> The problem will be solved if the flush method of  DFS client can renew lease 
> on demand. That is, it should try to re-new lease  when it catches a lease 
> expiration exception. That way,  even when under heavy load and the lease 
> renewing thread runs behind, the reducer  task (or what ever tasks use that 
> client) can preceed.  That will be a huge saving in some cases (where sorting 
> intermediate results take a long time to finish). We can set a limit on the 
> number of retries, and may even make it configurable (or changeable at 
> runtime). 
> The namenode can use a different expiration time that is much higher than the 
> current 1 minute lease expiration time for cleaning  up the abandoned 
> unclosed files.
>  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to