[ 
https://issues.apache.org/jira/browse/HADOOP-16022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736256#comment-16736256
 ] 

BELUGA BEHR commented on HADOOP-16022:
--------------------------------------

[[email protected]] Thanks Steve for the interest.

I looked at the test failures and found the entire setup a bit wonky.

In particular... 
[Here|https://github.com/apache/hadoop/blob/7b57f2f71fbaa5af4897309597cca70a95b04edd/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFile.java#L659]

 
{code:java|title=TFile.java}
void finishDataBlock(boolean bForceFinish) throws IOException {
...
    // exceeded the size limit, do the compression and finish the block
    if (bForceFinish || blkAppender.getCompressedSize() >= sizeMinBlock) {
...

{code}
As I understand it:

The general flow of this code is that a bunch of small records are serialized 
into bytes and written out to a stream. After a certain threshold of bytes from 
the stream have been compressed, the stream is stopped, flushed, and written 
out as a single block. Well, the current logic is a bit flawed I believe 
because, as we can see here, the block size is based on the size of the 
compressed bytes and not the total number of bytes written into the stream.

What is happening here is that as the bytes are written to the stream, they are 
first buffered (into the {{BufferedInputStream}} I touched) before being passed 
to the compressor. The bytes only make it to the compressor once the buffer has 
filled and been forced to flush.

So, in the current implementation, 4K bytes are written to the 
{{BufferedInputStream}}, the buffer is flushed, the bytes compressed, the 
compressed size reported by {{getCompressedSize()}}, and then flushed out as a 
block. When I changed the buffer to 8K, now twice the amount of data is being 
buffered before compression and is written to each block. This is very 
confusing to say the least... the number of blocks written out are dependent on 
the arbitrary size of this {{BufferedInputStream}} returned by the 
{{Compression}} class. That is very confusing and hard to test.  The person 
crafting the unit test must know how big this internal, non-configurable, write 
buffer is in order to write an effective test.  Also, if we use the default JDK 
buffer size (as recommended), these tests may fail depending on the JDK 
implementation. I think it is better to change the code to make blocks based on 
the number of raw bytes written into the stream, not the number of bytes in its 
compressed form. In this way, writing {{n}} bytes will always yield {{y}} 
blocks, no matter how big the write buffer is.

 

Thoughts?

> Increase Compression Buffer Sizes - Remove Magic Numbers
> --------------------------------------------------------
>
>                 Key: HADOOP-16022
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16022
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 2.10.0, 3.2.0
>            Reporter: BELUGA BEHR
>            Assignee: BELUGA BEHR
>            Priority: Minor
>         Attachments: HADOOP-16022.1.patch
>
>
> {code:java|title=Compression.java}
>     // data input buffer size to absorb small reads from application.
>     private static final int DATA_IBUF_SIZE = 1 * 1024;
>     // data output buffer size to absorb small writes from application.
>     private static final int DATA_OBUF_SIZE = 4 * 1024;
> {code}
> There exists these hard coded buffer sizes in the Compression code.  Instead, 
> use the JVM default sizes, which, this day and age, are usually set for 8K.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to