[jira] Commented: (HADOOP-3315) New binary file format

Hong Tang (JIRA) Tue, 09 Sep 2008 21:24:10 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629699#action_12629699
 ]


Hong Tang commented on HADOOP-3315:
-----------------------------------

Just checked the code (CodedInputStream.java). The protocol buffer VInt (or 
VLong) is pretty interesting. They first transform the integer through ZigZag 
encoding, which essentially transform the long into leading 000+[n]+[sign]. 
They then encode the n+1 bits using ceiling((n+1)/7) bytes (in little-endian 
style). So effectively, 1B can represent -64 to 63, 2B: -8K to 8K-1, 3B: -1M to 
1M-1, 4B: -128M to 128M. Comparing to my encoding scheme, I basically traded 
off some 2B encoding space for expanded 1B coverage. Additionally, protocol 
buffer's decoding requires you to read byte after byte, while both 
WritableUtils and my VLong can detect the length of the whole encoding after 
the first byte.



> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>         Attachments: HADOOP-3315_TFILE_PREVIEW.patch, 
> HADOOP-3315_TFILE_PREVIEW_WITH_LZO_TESTS.patch, TFile Specification Final.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs 
> to compress or decompress. It would be good to have a file format that only 
> needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3315) New binary file format

Reply via email to