[ 
https://issues.apache.org/jira/browse/HDFS-959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831543#action_12831543
 ] 

Todd Lipcon commented on HDFS-959:
----------------------------------

bq. 2.1: Concurrency of Disk Writes: Check sum verification and writing data to 
disk can be moved to a separate thread ("Disk Write Thread"). This will allow 
the existing "network thread" to trigger faster acks to the DFSClient.

I don't quite like this. In my mind, the ack signifies that the packet has been 
successfully written. This optimization you've described should only affect 
latency, not throughput. The writer already establishes a pipeline of unacked 
writes, so the latency improvement shouldn't change throughput. However, it 
does make sense for the DN to forward the packet downstream concurrently while 
writing to disk, so long as the disk write blocks the ack back down the 
pipeline.

The fact that you're seeing a 15-20% gain here suprises me. Maybe that 
indicates that our MAX_PACKETS value is too low. We currently allow 80 packets 
(~ 5MB) worth of unacked data. I can see where this would be too low - 5MB / 
(150MB/sec) = 33ms. Given write latencies and even context switching can really 
add up on a heavily loaded system, this might be too small of a window.

Let's add some instrumentation to waitAndQueuePacket to understand how much the 
writer is delayed by waiting for packet, and then increase MAX_PACKETS and see 
the effect on this count. If the throughput goes up and the count goes down, we 
know we should increase MAX_PACKETS.

bq. 2.5 Checksum buffered writes

Regarding this one, you should have a look at the FSInputChecker changes made 
by HADOOP-3205. This same idea should work on FSOutputSummer and I don't see 
any reason why it wouldn't have a similar speedup - I just never got around to 
it.


Regarding all of the 2% improvements, what was your test methodology? In my 
experience, TestDFSIO's variance is too high to notice a 2% improvement without 
running it a _lot_ of times.

Also, regarding the multinode replication tests - did you notice an appreciable 
difference in CPU usage in either direction? If these changes improve repl=1 
write performance, but increase CPU usage for repl=2 and repl=3, I think that's 
a problem. Very few real use cases use repl=1.

> Performance improvements to DFSClient and DataNode for faster DFS write at 
> replication factor of 1
> --------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-959
>                 URL: https://issues.apache.org/jira/browse/HDFS-959
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, hdfs client
>    Affects Versions: 0.20.2, 0.22.0
>         Environment: RHEL5 on Dual CPU quad-core Intel servers, 16 GB RAM, 4 
> SATA disks.
>            Reporter: Naredula Janardhana Reddy
>             Fix For: 0.20.2, 0.22.0
>
>
> The following improvements are suggested to DFSClient and DataNode to improve 
> DFS write throughput, based on experimental verification with replication 
> factor of 1.
> The changes are useful in principle for replication factors of 2 and 3 as 
> well, but they do not currently demonstrate noticeable performance 
> improvement in our test-bed because of a network throughput bottleneck that 
> hides the benefit of these changes. 
> All changes are applicable to 0.20.2. Some of them are applicable to trunk, 
> as noted below. I have not verified applicability to 0.21.
> List of Improvements
> -----------------------------
> Item 1: DFSCilent. Finer grain locks in WriteChunk(). Currently the lock is 
> held at the data block level (512 bytes). It can be moved to the packet level 
> (64kbytes), to lower the frequency of locking.
>  This optimization applies to 20.2. It already appears in trunk.
> Item 2: Misc. improvements to DataNode
>  2.1:  Concurrency of Disk Writes: Check sum verification and writing data to 
> disk can be moved to a separate thread ("Disk Write Thread"). This will allow 
> the existing "network thread" to trigger faster  acks to the DFSClient. This 
> will also allow the packet to be transmitted to the replication node faster. 
> In effect, this will allow DataNode to consume packets at higher speeds.
>  This optimization applies to 20.2 and trunk.
>  2.2:  Bulk Receive and Bulk Send: This optimization is enabled by doing 2.1. 
> We can now have DataNode receive more than one packet at a time since we have 
> added a buffer between the (existing) network thread and the (newly added) 
> Disk Write thread.
>  This optimization applies to 20.2 and trunk.
>  2.3: Early Ack:  The proposed optimization is to send out acks to the client 
> as soon as possible instead of waiting for the disk write. Note that, the 
> last ack is an exception: It will be sent only after data has been flushed to 
> the OS.
>  This optimization applies to 20.2. It already appears in trunk.
>  2.4: lseek optimization: Currently lseek (the system call) is called before 
> every disk write, which is not necessary when the write is sequential. The 
> propsed optimization calls lseek only when necessary.
>  This optimization applies to 20.2. I was unable to tell if it is already in 
> trunk.
>  2.5 Checksum buffered writes: Currently checksum is written in a buffered 
> stream of size 512 bytes. This can be increased to a higher numbers - such as 
> 4kbytes - to lower the number of write() system calls. This will save context 
> switch overhead.
>  This optimization applies to 20.2. I was unable to tell if it is already in 
> trunk.
> Item 3: Applying HADOOP-6166 - PureJavaCrc32() - from trunk to 20.2
>  This is applicable to 20.2.  It already appears in trunk.
> Performance Experiments Results
> -----------------------------------------------
> Performance experiments showed the following numbers:
> Hadoop Version: 0.20.2
> Server Configs: RHEL5, Quad-core dual-CPU, 16GB RAM, 4 SATA disks
>  $ uname -a
>  Linux gsbl90324.blue.ygrid.yahoo.com 2.6.18-53.1.13.el5 #1 SMP Mon Feb 11 
> 13:27:27 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
>  $ cat /proc/cpuinfo
>  model name   : Intel(R) Xeon(R) CPU           L5420  @ 2.50GHz
>  $ cat /etc/issue
>  Red Hat Enterprise Linux Server release 5.1 (Tikanga)
>  Kernel \r on an \m
> Benchmark Details
> --------------------------
> Benchmark Name: DFSIO
> Benchmark Configuration:
>  a) # maps (writers to DFS per node). Tried the following values: 1,2,3
>  b) # of nodes: Single-node test and 15-node cluster test
> Results Summary
> --------------------------
> a) With all the above optimizations turned on
> All these tests were done with replication factor of 1. Tests with 
> replication factors of 2 and 3 showed no noticeably improvement, because 
> these improvements are shielded by network bandwidth as noted above.
> What was measured: Write throughput per client (in MB/s)
> | Test Description                                                          | 
>  Baseline (MB/s)  | With improvements (MB/s) |  % improvement |
> | 15-node cluster with 1 map (writer) per node       |  103                   
>      | 147                                          | ~43 %                   
>    |
> | Single node test with 1 maps (writer) per node    |  102                    
>     | 148                                          | ~45 %                    
>   |
> | Single node test with 2 maps (writers) per node  |   86                     
>     | 101                                          | ~16 %                    
>   |
> | Single node test with 3 maps (writers) per node  |   67                     
>     |   76                                          | ~13 %                   
>     |
>     
> a) With above optimizations turned on individually
> I ran some experiments by adding and removing items individually to 
> understand the approximate range of performance contribution from each item. 
> These are the numbers I got (They are approximate).
> | ITEM        | Title                                                         
>                     | Improvement in 0.20 | Improvement in trunk |
> | Item 1      | DFSCilent. Finer grain locks in WriteChunk()    |      30%    
>                      | Already in trunk            |
> | Item 2.1   | Concurrency of Disk Writes                                   | 
>     25%                          | 15-20%                         |
> | Item 2.2   | Bulk Receive and Bulk Send                                 |   
>     2%                          | (Have not yet tried)      |
> | Item 2.3   | Early Ack                                                      
>               |       2%                          | Already in trunk          
>   |
> | Item 2.4   | lseek optimization                                             
>       |       2%                          | (Have not yet tried)       |
> | Item 2.5   | Checksum buffered writes                                    |  
>      2%                          | (Have not yet tried)       |
> | Item 3      | Applying HADOOP-6166 - PureJavaCrc32()       |    15%         
>                  | Already in trunk             |
> Patches
> -----------
> I will submit a patch for 0.20.2 shortly (in a day).
> I expect to submit a patch for trunk after review comments for above patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to