[
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629690#action_12629690
]
Hong Tang commented on HADOOP-3315:
-----------------------------------
* Is Util#memcmp different from WritableComparator#compareBytes()?
[Hong] Looks like they are the same. My oversight.
* Shouldn't BoundedByteArrayOutputStream extend ByteArrayOutputStream?
[Hong] No, it should not. ByteArrayOutputStream does not bound the # of bytes
written to the output stream. It automatically increases the size of the
internal buffer, which is not what we want. There is little to be shared
between the two except for the buffer and count definition.
* the VLong code duplicates code in WritableUtils, no?
[Hong] No. The new VLong format enlarges the range of integers that can be
encoded with 2-4 bytes (with the expense of reduced range of negative integers
in 1 byte case). The new format can represent integers from -32 to 127 with 1B,
-5120 to 5119 for 2B, -1M to 1M-1 for 3B, and -128M to 128M-1 for 4B. Comparing
WritableUtils.VLong: 1B -112 to 127, 2B: -256 to 255, 3B: -64K to 64K-1, 4B:
-16M to 16M-1. This encoding scheme is more efficient for TFile where we may
have lots of small integers but never small negative integers.
* readString/writeString duplicates Text methods.
[Hong] Sort of, but because Text.readString and writeString uses
WriteableUtils.VInt, if we were to use these methods directly, we would have to
document WritableUtils's VInt/VLong encoding as well, which is kind of
confusing to define two VInt/VLong standards in one spec.
* should the Compression enum be simply a new method on
CompressionCodecFactory? If not, shouldn't it go in the io.compress package?
[Hong] This part is a quick implementation of what should be a more extensible
compression algorithm management in the future. The reason we did not directly
use CompressionCodecFactory is because CompressionCodecFactory.getCodec()
expects a path and the uses the suffix portion of the path to find the codec is
based on some configuration. Directly using it would break TFile spec's
requirement for language/implementation neutrality. On the other hand, it may
be nice to include standard string name to compression codec definition in
Hadoop.
> New binary file format
> ----------------------
>
> Key: HADOOP-3315
> URL: https://issues.apache.org/jira/browse/HADOOP-3315
> Project: Hadoop Core
> Issue Type: New Feature
> Components: io
> Reporter: Owen O'Malley
> Assignee: Amir Youssefi
> Attachments: HADOOP-3315_TFILE_PREVIEW.patch,
> HADOOP-3315_TFILE_PREVIEW_WITH_LZO_TESTS.patch, TFile Specification Final.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs
> to compress or decompress. It would be good to have a file format that only
> needs
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.