[
https://issues.apache.org/jira/browse/HDFS-959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831452#action_12831452
]
Zlatin Balevsky commented on HDFS-959:
--------------------------------------
How does item 1 relate to HDFS-916 and HDFS-918?
+1 on items 2.*
> Performance improvements to DFSClient and DataNode for faster DFS write at
> replication factor of 1
> --------------------------------------------------------------------------------------------------
>
> Key: HDFS-959
> URL: https://issues.apache.org/jira/browse/HDFS-959
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: data-node, hdfs client
> Affects Versions: 0.20.2, 0.22.0
> Environment: RHEL5 on Dual CPU quad-core Intel servers, 16 GB RAM, 4
> SATA disks.
> Reporter: Naredula Janardhana Reddy
> Fix For: 0.20.2, 0.22.0
>
>
> The following improvements are suggested to DFSClient and DataNode to improve
> DFS write throughput, based on experimental verification with replication
> factor of 1.
> The changes are useful in principle for replication factors of 2 and 3 as
> well, but they do not currently demonstrate noticeable performance
> improvement in our test-bed because of a network throughput bottleneck that
> hides the benefit of these changes.
> All changes are applicable to 0.20.2. Some of them are applicable to trunk,
> as noted below. I have not verified applicability to 0.21.
> List of Improvements
> -----------------------------
> Item 1: DFSCilent. Finer grain locks in WriteChunk(). Currently the lock is
> held at the data block level (512 bytes). It can be moved to the packet level
> (64kbytes), to lower the frequency of locking.
> This optimization applies to 20.2. It already appears in trunk.
> Item 2: Misc. improvements to DataNode
> 2.1: Concurrency of Disk Writes: Check sum verification and writing data to
> disk can be moved to a separate thread ("Disk Write Thread"). This will allow
> the existing "network thread" to trigger faster acks to the DFSClient. This
> will also allow the packet to be transmitted to the replication node faster.
> In effect, this will allow DataNode to consume packets at higher speeds.
> This optimization applies to 20.2 and trunk.
> 2.2: Bulk Receive and Bulk Send: This optimization is enabled by doing 2.1.
> We can now have DataNode receive more than one packet at a time since we have
> added a buffer between the (existing) network thread and the (newly added)
> Disk Write thread.
> This optimization applies to 20.2 and trunk.
> 2.3: Early Ack: The proposed optimization is to send out acks to the client
> as soon as possible instead of waiting for the disk write. Note that, the
> last ack is an exception: It will be sent only after data has been flushed to
> the OS.
> This optimization applies to 20.2. It already appears in trunk.
> 2.4: lseek optimization: Currently lseek (the system call) is called before
> every disk write, which is not necessary when the write is sequential. The
> propsed optimization calls lseek only when necessary.
> This optimization applies to 20.2. I was unable to tell if it is already in
> trunk.
> 2.5 Checksum buffered writes: Currently checksum is written in a buffered
> stream of size 512 bytes. This can be increased to a higher numbers - such as
> 4kbytes - to lower the number of write() system calls. This will save context
> switch overhead.
> This optimization applies to 20.2. I was unable to tell if it is already in
> trunk.
> Item 3: Applying HADOOP-6166 - PureJavaCrc32() - from trunk to 20.2
> This is applicable to 20.2. It already appears in trunk.
> Performance Experiments Results
> -----------------------------------------------
> Performance experiments showed the following numbers:
> Hadoop Version: 0.20.2
> Server Configs: RHEL5, Quad-core dual-CPU, 16GB RAM, 4 SATA disks
> $ uname -a
> Linux gsbl90324.blue.ygrid.yahoo.com 2.6.18-53.1.13.el5 #1 SMP Mon Feb 11
> 13:27:27 EST 2008 x86_64 x86_64 x86_64 GNU/Linux
> $ cat /proc/cpuinfo
> model name : Intel(R) Xeon(R) CPU L5420 @ 2.50GHz
> $ cat /etc/issue
> Red Hat Enterprise Linux Server release 5.1 (Tikanga)
> Kernel \r on an \m
> Benchmark Details
> --------------------------
> Benchmark Name: DFSIO
> Benchmark Configuration:
> a) # maps (writers to DFS per node). Tried the following values: 1,2,3
> b) # of nodes: Single-node test and 15-node cluster test
> Results Summary
> --------------------------
> a) With all the above optimizations turned on
> All these tests were done with replication factor of 1. Tests with
> replication factors of 2 and 3 showed no noticeably improvement, because
> these improvements are shielded by network bandwidth as noted above.
> What was measured: Write throughput per client (in MB/s)
> | Test Description |
> Baseline (MB/s) | With improvements (MB/s) | % improvement |
> | 15-node cluster with 1 map (writer) per node | 103
> | 147 | ~43 %
> |
> | Single node test with 1 maps (writer) per node | 102
> | 148 | ~45 %
> |
> | Single node test with 2 maps (writers) per node | 86
> | 101 | ~16 %
> |
> | Single node test with 3 maps (writers) per node | 67
> | 76 | ~13 %
> |
>
> a) With above optimizations turned on individually
> I ran some experiments by adding and removing items individually to
> understand the approximate range of performance contribution from each item.
> These are the numbers I got (They are approximate).
> | ITEM | Title
> | Improvement in 0.20 | Improvement in trunk |
> | Item 1 | DFSCilent. Finer grain locks in WriteChunk() | 30%
> | Already in trunk |
> | Item 2.1 | Concurrency of Disk Writes |
> 25% | 15-20% |
> | Item 2.2 | Bulk Receive and Bulk Send |
> 2% | (Have not yet tried) |
> | Item 2.3 | Early Ack
> | 2% | Already in trunk
> |
> | Item 2.4 | lseek optimization
> | 2% | (Have not yet tried) |
> | Item 2.5 | Checksum buffered writes |
> 2% | (Have not yet tried) |
> | Item 3 | Applying HADOOP-6166 - PureJavaCrc32() | 15%
> | Already in trunk |
> Patches
> -----------
> I will submit a patch for 0.20.2 shortly (in a day).
> I expect to submit a patch for trunk after review comments for above patch.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.