Possible performance enhancement in Hadoop compress module
----------------------------------------------------------

                 Key: HADOOP-4196
                 URL: https://issues.apache.org/jira/browse/HADOOP-4196
             Project: Hadoop Core
          Issue Type: Improvement
          Components: io
    Affects Versions: 0.18.0
            Reporter: Hong Tang


There are several less performant implementation issues with the current Hadoop 
compression module. Generally, the opportunities all come from the fact that 
the granuarities of I/O operations from the CompressionStream and 
DecompressionStream are not controllable by the users, and thus users are 
forced to attach BufferedInputStream or BufferedOutputStream to both ends of 
the CompressionStream and DecompressionStream:
- ZlibCompressor: always returns false from needInput() after setInput(), and 
thus lead to a native call deflateBytesDirect() for almost every write() 
operation from CompressorStream(). This becomes problematic when applications 
call write() on the CompressorStream with small write sizes (e.g. one byte at a 
time). It is better to follow similar code path in LzoCompressor and append to 
internal uncompressed data buffer.
- CompressorStream: whenever the compressor produces some compressed data, it 
will directly issue write() calls to the down stream. Could be improved by keep 
appending to the byte[] until it is full (or half full) before writing to the 
down stream. Otherwise, applications have to use a BufferedOutputStream as the 
down stream in case the output sizes from CompressorStream is too small. This 
generally causes double buffering.
- BlockCompressorStream: similar issue as described above.
- BlockDecompressorStream: getCompressedData() reads only one compressed chunk 
at a time. Could be better to read a full buffer, and then obtain compressed 
chunk from buffer (similar to DecompressStream is doing, but admittedly a bit 
more complicated).

In generally, the following could be some guideline of Compressor/Decompressor 
and CompressorStream/DecompressorStream design/implementation that can give 
users some performance guarantee:
- Compressor and Decompressor keep two DirectByteBuffer, the size of which 
should be tuned to be optimal with regard to the specific 
compression/decompression algorithm. Ensure always call Compressor.compress() 
will a full (or near full) uncompressed data DirectBuffer.
- CompressorStream and DecompressorStream maintains a byte[] to read data from 
the down stream. The size of the byte[] should be user customizable (add a 
bufferSize parameter to CompressionCodec's createInputStream and 
createOutputStream interface). Ensure that I/O from the down stream at or near 
the granularity of the size of the byte[]. So applications can simply rely on 
the buffering inside CompressorStream and DecompressorStream (for the case of 
LZO: BlockCompressorStream and BlockDecompressorStream).

A more radical change would be to let the downward InputStream to directly 
deposit data to a ByteBuffer or the downard OutputStream to accept input data 
from ByteBuffer. We may call it ByteBufferInputStream and 
ByteBufferOutputStream. The CompressorStream and DecompressorStream may simply 
test whether the down stream indeed implements such interfaces and bypass its 
own byte[] buffer if true.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to