[ 
https://issues.apache.org/jira/browse/HADOOP-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847478#action_12847478
 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-6640:
------------------------------------------------

When FileSystem cache is enabled, FileSystem.get(..) will call 
FileSystem.Cache.get(..), which is a synchronized method. If the lookup fails, 
a new instance will be initialized. Depends on the FileSystem subclass 
implementation, the initialization may take a long time. In such case, the 
FileSystem.Cache lock will be hold and all calls to FileSystem.get(..) by other 
threads will be blocked for a long time.

In particular, the DistributedFileSystem initialization may take a long time 
since there are retries. It is even worst if the socket timeout is set to a 
large value.

There are two possible fixes for the problem:

# (by Sanjay) Change FileSystem.Cache.get(..) so that if the lookup fails, it 
first releases the lock, initializes a FileSystem instance, acquires the lock 
again, and then add the instance to the cache.  One problem is that if a user 
application keeps calling FileSystem.get(..) for the same FileSystem in a short 
period of time, it will result in initializing many instances.

# Change DistributedFileSystem so that it does a lazy connection: it defers 
connecting to the server until there is an rpc.  A drawback is that this only 
fixes DistributedFileSystem but not other FileSystem subclasses.

> FileSystem.get() does RPC retries within a static synchronized block
> --------------------------------------------------------------------
>
>                 Key: HADOOP-6640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6640
>             Project: Hadoop Common
>          Issue Type: Bug
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Priority: Critical
>
> If using FileSystem.get() in a multithreaded environment, and one get() locks 
> because the NN URI is too slow or not responding and retries are in progress, 
> all other get() (for the diffferent users, NN) are blocked.
> the synchronized block in in the static instance of Cache inner class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to