[
https://issues.apache.org/jira/browse/HDFS-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14037675#comment-14037675
]
James Thomas commented on HDFS-6561:
------------------------------------
It seems like the best approach here would be to add some buffering to
FSOutputSummer. I did some testing and it seems like going to native code is
strictly faster than using the incremental summer as long as flushes happen no
less often than every 100 bytes. More specifically, if we use the native code,
which has no option for incremental checksumming (it could be added in, but I
don't think it's necessary), if we compute a checksum of a partial chunk, we
will need to recompute the entire checksum (rather than just the checksum of
the newly added bytes) when the chunk either 1) fills out or 2) is incremented
a bit and then a flush occurs. But the native checksum on 512 bytes (the chunk
size and the maximum possible size of a partial chunk) takes at most 7
microseconds on my machine, and the incremental Java checksum requires around 7
microseconds for 100 bytes. So as long as it's reasonable to assume flushes
won't happen more often than every 100 bytes, I propose that we eliminate the
incremental summer from FSOutputSummer and always use the native code for byte
arrays.
Buffering FSOutputSummer will mean that writes on the wire are more bursty and
data is kept in client memory longer, but Todd mentioned that many clients
already wrap DFSOutputStream in a BufferedWriter anyway. To achieve maximum
performance in the native code, we need the buffer to be at least a few
thousand bytes. The larger the buffer, the better, since we can amortize away
the cost of crossing the JNI boundary (which seems to be quite small actually,
on the order of a microsecond at most).
What do people think about this?
> Byte array native checksumming on client side
> ---------------------------------------------
>
> Key: HDFS-6561
> URL: https://issues.apache.org/jira/browse/HDFS-6561
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: datanode, hdfs-client, performance
> Reporter: James Thomas
> Assignee: James Thomas
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)