[
https://issues.apache.org/jira/browse/HADOOP-16022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736256#comment-16736256
]
BELUGA BEHR commented on HADOOP-16022:
--------------------------------------
[[email protected]] Thanks Steve for the interest.
I looked at the test failures and found the entire setup a bit wonky.
In particular...
[Here|https://github.com/apache/hadoop/blob/7b57f2f71fbaa5af4897309597cca70a95b04edd/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFile.java#L659]
{code:java|title=TFile.java}
void finishDataBlock(boolean bForceFinish) throws IOException {
...
// exceeded the size limit, do the compression and finish the block
if (bForceFinish || blkAppender.getCompressedSize() >= sizeMinBlock) {
...
{code}
As I understand it:
The general flow of this code is that a bunch of small records are serialized
into bytes and written out to a stream. After a certain threshold of bytes from
the stream have been compressed, the stream is stopped, flushed, and written
out as a single block. Well, the current logic is a bit flawed I believe
because, as we can see here, the block size is based on the size of the
compressed bytes and not the total number of bytes written into the stream.
What is happening here is that as the bytes are written to the stream, they are
first buffered (into the {{BufferedInputStream}} I touched) before being passed
to the compressor. The bytes only make it to the compressor once the buffer has
filled and been forced to flush.
So, in the current implementation, 4K bytes are written to the
{{BufferedInputStream}}, the buffer is flushed, the bytes compressed, the
compressed size reported by {{getCompressedSize()}}, and then flushed out as a
block. When I changed the buffer to 8K, now twice the amount of data is being
buffered before compression and is written to each block. This is very
confusing to say the least... the number of blocks written out are dependent on
the arbitrary size of this {{BufferedInputStream}} returned by the
{{Compression}} class. That is very confusing and hard to test. The person
crafting the unit test must know how big this internal, non-configurable, write
buffer is in order to write an effective test. Also, if we use the default JDK
buffer size (as recommended), these tests may fail depending on the JDK
implementation. I think it is better to change the code to make blocks based on
the number of raw bytes written into the stream, not the number of bytes in its
compressed form. In this way, writing {{n}} bytes will always yield {{y}}
blocks, no matter how big the write buffer is.
Thoughts?
> Increase Compression Buffer Sizes - Remove Magic Numbers
> --------------------------------------------------------
>
> Key: HADOOP-16022
> URL: https://issues.apache.org/jira/browse/HADOOP-16022
> Project: Hadoop Common
> Issue Type: Improvement
> Components: io
> Affects Versions: 2.10.0, 3.2.0
> Reporter: BELUGA BEHR
> Assignee: BELUGA BEHR
> Priority: Minor
> Attachments: HADOOP-16022.1.patch
>
>
> {code:java|title=Compression.java}
> // data input buffer size to absorb small reads from application.
> private static final int DATA_IBUF_SIZE = 1 * 1024;
> // data output buffer size to absorb small writes from application.
> private static final int DATA_OBUF_SIZE = 4 * 1024;
> {code}
> There exists these hard coded buffer sizes in the Compression code. Instead,
> use the JVM default sizes, which, this day and age, are usually set for 8K.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]