[ 
https://issues.apache.org/jira/browse/HDFS-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556504#comment-13556504
 ] 

Todd Lipcon commented on HDFS-4417:
-----------------------------------

bq. I have to think about this a little bit more. I would like to avoid 
splitting DomainSocketFactory#create into two functions if it is at all 
possible. I feel like we could be a little bit smarter in PeerCache and avoid a 
lot of these problems. And yeah, we need a unit test. Mind if I take this one?

Go for it.

bq. If we do have a mismatch between the domain socket keepalive and the length 
of time we cache sockets in PeerCache, we obviously should fix that-- trying to 
use sockets that we ought to know are stale is not smart. (Obviously, there 
will always be some mismatches-- if the server's keepalive changes and not all 
clients are updated, etc.)

The other viewpoint is that, with the current mismatch, we at least know that 
we handle mismatches correctly, since they're easy to trigger. Also, iven that 
the keepalive time is a server-side config, it's tough to try to get them to 
match up. We're just following the footsteps of HTTP keepalive in which the 
server may drop the keepalive session.

                
> HDFS-347: fix case where local reads get disabled incorrectly
> -------------------------------------------------------------
>
>                 Key: HDFS-4417
>                 URL: https://issues.apache.org/jira/browse/HDFS-4417
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, hdfs-client, performance
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-4417.txt
>
>
> In testing HDFS-347 against HBase (thanks [~jdcryans]) we ran into the 
> following case:
> - a workload is running which puts a bunch of local sockets in the PeerCache
> - the workload abates for a while, causing the sockets to go "stale" (ie the 
> DN side disconnects after the keepalive timeout)
> - the workload starts again
> In this case, the local socket retrieved from the cache failed the 
> newBlockReader call, and it incorrectly disabled local sockets on that host. 
> This is similar to an earlier bug HDFS-3376, but not quite the same.
> The next issue we ran into is that, once this happened, it never tried local 
> sockets again, because the cache held lots of TCP sockets. Since we always 
> managed to get a cached socket to the local node, it didn't bother trying 
> local read again.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to