[ 
https://issues.apache.org/jira/browse/HDFS-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13560190#comment-13560190
 ] 

Colin Patrick McCabe commented on HDFS-4417:
--------------------------------------------

The new test works by setting up a scenario where we will have a lot of stale 
UNIX domain sockets in the PeerCache.  It does this by setting the socket 
keepalive to 1 millisecond, enlarging the cache size to 32, and setting the 
cache expiry size to several minutes.  Then it sets 
{{DFSInputStream#tcpReadsDisabledForTesting}}, which will cause an exception if 
we try to read over a TCP socket.

The idea is to catch the issue we saw before where UNIX domain sockets were 
getting stale and causing the socket path to get blacklisted.  This bad 
behavior caused us to fall back on TCP sockets in cases where we shouldn't have.
                
> HDFS-347: fix case where local reads get disabled incorrectly
> -------------------------------------------------------------
>
>                 Key: HDFS-4417
>                 URL: https://issues.apache.org/jira/browse/HDFS-4417
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, hdfs-client, performance
>            Reporter: Todd Lipcon
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-4417.002.patch, HDFS-4417.003.patch, 
> HDFS-4417.004.patch, hdfs-4417.txt
>
>
> In testing HDFS-347 against HBase (thanks [~jdcryans]) we ran into the 
> following case:
> - a workload is running which puts a bunch of local sockets in the PeerCache
> - the workload abates for a while, causing the sockets to go "stale" (ie the 
> DN side disconnects after the keepalive timeout)
> - the workload starts again
> In this case, the local socket retrieved from the cache failed the 
> newBlockReader call, and it incorrectly disabled local sockets on that host. 
> This is similar to an earlier bug HDFS-3376, but not quite the same.
> The next issue we ran into is that, once this happened, it never tried local 
> sockets again, because the cache held lots of TCP sockets. Since we always 
> managed to get a cached socket to the local node, it didn't bother trying 
> local read again.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to