[
https://issues.apache.org/jira/browse/HADOOP-8258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249035#comment-13249035
]
Todd Lipcon commented on HADOOP-8258:
-------------------------------------
In current versions of Hadoop, the read path for applications like HBase often
looks like:
- allocate a byte array for an HFile block (~64kb)
- call read() into that byte array:
-- copy 1: read() packets from the socket into a direct buffer provided by the
DirectBufferPool
-- copy 2: copy from the direct buffer pool into the provided byte[]
- call setInput on a decompressor
-- copy 3: copy from the byte[] back to a direct buffer inside the codec
implementation
- call decompress:
-- JNI code accesses the input buffer and writes to the output buffer
-- copy 4: from the output buffer back into the byte[] for the uncompressed
hfile block
-- ineffiency: HBase now does its own checksumming. Since it has to checksum
the byte[], it can't easily use the SSE-enabled checksum path.
Given the new direct-buffer read support introduced by HDFS-2834, we can remove
copy #2 and #3
- allocate a DirectBuffer for the compressed hfile block, and one for the
uncompressed block (we know the size from the hfile block header)
- call read() into the direct buffer using the HDFS-2834 API
-- copy 1: read() packets from the socket into that buffer
- call setInput() with that buffer. no copies necessary
- call decompress:
-- JNI code accesses the input buffer and writes directly to the output buffer,
with no copies
- HBase now has the uncompressed block as a direct buffer. It can use the
SSE-enabled checksum for better efficiency
This should improve the performance of HBase significantly. We may also be able
to use the new API from within SequenceFile and other compressible file formats
to avoid two copies from the read path.
Similar applies to the write path, but in my experience the write path is less
often CPU-constrained, so I'd prefer to concentrate on the read path first.
> Add interfaces for compression codecs to use direct byte buffers
> ----------------------------------------------------------------
>
> Key: HADOOP-8258
> URL: https://issues.apache.org/jira/browse/HADOOP-8258
> Project: Hadoop Common
> Issue Type: New Feature
> Components: io, native, performance
> Affects Versions: 3.0.0
> Reporter: Todd Lipcon
>
> Currently, the codec interface only provides input/output functions based on
> byte arrays. Given that most of the codecs are implemented in native code,
> this necessitates two extra copies - one to copy the input data to a direct
> buffer, and one to copy the output data back to a byte array. We should add
> interfaces to Decompressor/Compressor that can work directly with direct byte
> buffers to avoid these copies.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira