[ https://issues.apache.org/jira/browse/HDFS-6865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107404#comment-14107404 ]
James Thomas commented on HDFS-6865: ------------------------------------ I ran the tests that [~tlipcon] suggested and have some results. I created buffers of various sizes and repeatedly wrote them using FSDataOutputStream.write(). For each buffer size, I also wrapped FSDataOutputStream with a BufferedOutputStream. I made sure the packet size and block sizes were large enough that no actual writes to DataNodes occurred, so the times shown here primarily cover data buffering and checksumming and packet construction on the client side. The following times are all in milliseconds. Each test involved writing 8 MB of data to the stream. I only did one run for each of these data points, so there are a few unreproducible outliers (e.g. the 130ms in the 2^8 row), but the results are generally good enough that I didn't think averaging over a large number of runs was necessary. Some interpretation of the results: Naturally the time goes down with bigger buffers since we have fewer instructions (less method call overhead) per byte. At smaller buffer sizes the time for the checksum becomes more and more negligible compared to the other overheads per byte (after all, the checksum is a handful of instructions per byte even for the Java code), so we don't see much of a difference between the pre- and post-change code. The main case I was worried about was for input buffers (in the non-BufferedOuputStream case) larger than the original FSOutputSummer buffer (512 bytes) and smaller than the current FSOutputSummer buffer (5120 bytes), because these incur a buffer copy in the new FSOutputSummer (since there is now space for them in the FSOutputSummer's buffer) but were sent directly to the DFSOutputStream (to be copied into a packet) in the old FSOutputStream. But the data shows that this case (rows 2^9 and 2^10) is not problematic -- clearly the extra buffer copies are offset by the time saved by faster checksumming. ||log(Buffer Size)||pre-change||pre-change w/ BufferedStream||post-change||post-change w/ BufferedStream| |0|463|258|449|261| |1|249|125|213|118| |2|133|61|112|62| |3|42|16|56|22| |4|32|21|22|8| |5|15|14|18|8| |6|19|9|7|6| |7|18|28|11|5| |8|14|15|5|130| |9|12|12|4|4| |10|15|8|5|4| > Byte array native checksumming on client side (HDFS changes) > ------------------------------------------------------------ > > Key: HDFS-6865 > URL: https://issues.apache.org/jira/browse/HDFS-6865 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client, performance > Reporter: James Thomas > Assignee: James Thomas > Attachments: HDFS-6865.2.patch, HDFS-6865.3.patch, HDFS-6865.4.patch, > HDFS-6865.5.patch, HDFS-6865.patch > > > Refactor FSOutputSummer to buffer data and use the native checksum > calculation functionality introduced in HADOOP-10975. -- This message was sent by Atlassian JIRA (v6.2#6252)