[jira] [Commented] (HDFS-6865) Byte array native checksumming on client side (HDFS changes)

James Thomas (JIRA) Fri, 22 Aug 2014 13:10:45 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-6865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107404#comment-14107404
 ]


James Thomas commented on HDFS-6865:
------------------------------------

I ran the tests that [~tlipcon] suggested and have some results. I created 
buffers of various sizes and repeatedly wrote them using 
FSDataOutputStream.write(). For each buffer size, I also wrapped 
FSDataOutputStream with a BufferedOutputStream. I made sure the packet size and 
block sizes were large enough that no actual writes to DataNodes occurred, so 
the times shown here primarily cover data buffering and checksumming and packet 
construction on the client side.

The following times are all in milliseconds. Each test involved writing 8 MB of 
data to the stream. I only did one run for each of these data points, so there 
are a few unreproducible outliers (e.g. the 130ms in the 2^8 row), but the 
results are generally good enough that I didn't think averaging over a large 
number of runs was necessary.

Some interpretation of the results: Naturally the time goes down with bigger 
buffers since we have fewer instructions (less method call overhead) per byte. 
At smaller buffer sizes the time for the checksum becomes more and more 
negligible compared to the other overheads per byte (after all, the checksum is 
a handful of instructions per byte even
for the Java code), so we don't see much of a difference between the pre- and 
post-change code. The main case I was worried about was for input buffers (in 
the non-BufferedOuputStream case) larger than the original FSOutputSummer 
buffer (512 bytes) and smaller than the current FSOutputSummer buffer (5120 
bytes), because these incur a buffer copy in the new FSOutputSummer (since 
there is now space for them in the FSOutputSummer's buffer) but were sent 
directly to the DFSOutputStream (to be copied into a packet) in the old 
FSOutputStream. But the data shows that this case (rows 2^9 and 2^10) is not 
problematic -- clearly the extra buffer copies are offset by the time saved by 
faster checksumming.

||log(Buffer Size)||pre-change||pre-change w/ 
BufferedStream||post-change||post-change w/ BufferedStream|
|0|463|258|449|261|
|1|249|125|213|118|
|2|133|61|112|62|
|3|42|16|56|22|
|4|32|21|22|8|
|5|15|14|18|8|
|6|19|9|7|6|
|7|18|28|11|5|
|8|14|15|5|130|
|9|12|12|4|4|
|10|15|8|5|4|


> Byte array native checksumming on client side (HDFS changes)
> ------------------------------------------------------------
>
>                 Key: HDFS-6865
>                 URL: https://issues.apache.org/jira/browse/HDFS-6865
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client, performance
>            Reporter: James Thomas
>            Assignee: James Thomas
>         Attachments: HDFS-6865.2.patch, HDFS-6865.3.patch, HDFS-6865.4.patch, 
> HDFS-6865.5.patch, HDFS-6865.patch
>
>
> Refactor FSOutputSummer to buffer data and use the native checksum 
> calculation functionality introduced in HADOOP-10975.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6865) Byte array native checksumming on client side (HDFS changes)

Reply via email to