[jira] Commented: (HADOOP-3315) New binary file format

Hong Tang (JIRA) Tue, 09 Sep 2008 20:20:36 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629690#action_12629690
 ]


Hong Tang commented on HADOOP-3315:
-----------------------------------

    * Is Util#memcmp different from WritableComparator#compareBytes()?
[Hong] Looks like they are the same. My oversight.

    * Shouldn't BoundedByteArrayOutputStream extend ByteArrayOutputStream?
[Hong] No, it should not. ByteArrayOutputStream does not bound the # of bytes 
written to the output stream. It automatically increases the size of the 
internal buffer, which is not what we want. There is little to be shared 
between the two except for the buffer and count definition.

    * the VLong code duplicates code in WritableUtils, no?
[Hong] No. The new VLong format enlarges the range of integers that can be 
encoded with 2-4 bytes (with the expense of reduced range of negative integers 
in 1 byte case). The new format can represent integers from -32 to 127 with 1B, 
-5120 to 5119 for 2B, -1M to 1M-1 for 3B, and -128M to 128M-1 for 4B. Comparing 
WritableUtils.VLong: 1B -112 to 127, 2B: -256 to 255, 3B: -64K to 64K-1, 4B: 
-16M to 16M-1. This encoding scheme is more efficient for TFile where we may 
have lots of small integers but never small negative integers.

    * readString/writeString duplicates Text methods.
[Hong] Sort of, but because Text.readString and writeString uses 
WriteableUtils.VInt, if we were to use these methods directly, we would have to 
document WritableUtils's VInt/VLong encoding as well, which is kind of 
confusing to define two VInt/VLong standards in one spec.

    * should the Compression enum be simply a new method on 
CompressionCodecFactory? If not, shouldn't it go in the io.compress package?
[Hong] This part is a quick implementation of what should be a more extensible 
compression algorithm management in the future. The reason we did not directly 
use CompressionCodecFactory is because CompressionCodecFactory.getCodec()  
expects a path and the uses the suffix portion of the path to find the codec is 
based on some configuration. Directly using it would break TFile spec's 
requirement for language/implementation neutrality. On the other hand, it may 
be nice to include standard string name to compression codec definition in 
Hadoop.

> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>         Attachments: HADOOP-3315_TFILE_PREVIEW.patch, 
> HADOOP-3315_TFILE_PREVIEW_WITH_LZO_TESTS.patch, TFile Specification Final.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs 
> to compress or decompress. It would be good to have a file format that only 
> needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3315) New binary file format

Reply via email to