[ 
https://issues.apache.org/jira/browse/HDFS-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-941:
-----------------------------

    Attachment: hdfs-941.txt

New patch with response to stack's review:

bq. ... google-collections ...

I think you were looking at a previous patch. New patch uses guava, which is 
already included in ivy.xml since a month or two ago

bq. If you make another version of this patch, change the mentions of getEOS in 
comments to be 'eos' to match the change of variable name.

fixed (only found one such mention)

bq. 469 +        sock.setTcpNoDelay(true);
Sharp eyes. The reason we need TCP_NODELAY now whereas we didn't before is 
that, without it, we'll hit this nasty interaction between nagle's algorithm 
and delayed ACKs. Basically, our interaction pattern now looks like:

C -> DN: read block X
DN -> C: data
C -> DN: status OK
C -> DN: read block Y

The fact that we send two packets in a row triggers Nagle's algorithm - it 
won't send the read request for block Y until it's received an ACK for its 
"status OK" packet, or a length delay elapses (40ms I think in Linux). The DN 
isn't sending the ACK for "status OK" because it's gone into delayed ACK mode. 
So, throughput drops by a factor of 20x or more. I confirmed this with 
TestParallelRead - I saw a 12MB/sec result vs the 274M/sec with NODELAY on.

NODELAY fixes this by forcing the "read block Y" request to be sent immediately 
even though the ACK for the prior packet is still outstanding.

To be sure that the performance gains in this patch aren't just from NODELAY, I 
added NODELAY to unpatched trunk and ran TestParallelRead. It made no 
performance difference when it wasn't combined with connection reuse.

I'll add a comment to the code explaining why the nodelay is crucial for 
performance.

bq. Also, old code used set timer after making connection
Fixed, I don't think this was for any particular reason.

bq. You think 16 a good number for the socket cache (doesn't seem easily 
chanageable)?

16 seems like a decent default, but I made it configurable just in case it 
bites anyone in production. Added dfs.client.socketcache.capacity as an 
undocumented parameter, since we don't expect it to be tuned.


> Datanode xceiver protocol should allow reuse of a connection
> ------------------------------------------------------------
>
>                 Key: HDFS-941
>                 URL: https://issues.apache.org/jira/browse/HDFS-941
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, hdfs client
>    Affects Versions: 0.22.0
>            Reporter: Todd Lipcon
>            Assignee: bc Wong
>         Attachments: HDFS-941-1.patch, HDFS-941-2.patch, HDFS-941-3.patch, 
> HDFS-941-3.patch, HDFS-941-4.patch, HDFS-941-5.patch, HDFS-941-6.22.patch, 
> HDFS-941-6.patch, HDFS-941-6.patch, HDFS-941-6.patch, hdfs-941.txt, 
> hdfs-941.txt, hdfs-941.txt, hdfs941-1.png
>
>
> Right now each connection into the datanode xceiver only processes one 
> operation.
> In the case that an operation leaves the stream in a well-defined state (eg a 
> client reads to the end of a block successfully) the same connection could be 
> reused for a second operation. This should improve random read performance 
> significantly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to