[ 
https://issues.apache.org/jira/browse/HDFS-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037675#comment-14037675
 ] 

James Thomas commented on HDFS-6561:
------------------------------------

It seems like the best approach here would be to add some buffering to 
FSOutputSummer. I did some testing and it seems like going to native code is 
strictly faster than using the incremental summer as long as flushes happen no 
less often than every 100 bytes. More specifically, if we use the native code, 
which has no option for incremental checksumming (it could be added in, but I 
don't think it's necessary), if we compute a checksum of a partial chunk, we 
will need to recompute the entire checksum (rather than just the checksum of 
the newly added bytes) when the chunk either 1) fills out or 2) is incremented 
a bit and then a flush occurs. But the native checksum on 512 bytes (the chunk 
size and the maximum possible size of a partial chunk) takes at most 7 
microseconds on my machine, and the incremental Java checksum requires around 7 
microseconds for 100 bytes. So as long as it's reasonable to assume flushes 
won't happen more often than every 100 bytes, I propose that we eliminate the 
incremental summer from FSOutputSummer and always use the native code for byte 
arrays.

Buffering FSOutputSummer will mean that writes on the wire are more bursty and 
data is kept in client memory longer, but Todd mentioned that many clients 
already wrap DFSOutputStream in a BufferedWriter anyway. To achieve maximum 
performance in the native code, we need the buffer to be at least a few 
thousand bytes. The larger the buffer, the better, since we can amortize away 
the cost of crossing the JNI boundary (which seems to be quite small actually, 
on the order of a microsecond at most).

What do people think about this?

> Byte array native checksumming on client side
> ---------------------------------------------
>
>                 Key: HDFS-6561
>                 URL: https://issues.apache.org/jira/browse/HDFS-6561
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, hdfs-client, performance
>            Reporter: James Thomas
>            Assignee: James Thomas
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to