[jira] Commented: (HADOOP-3315) New binary file format

Owen O'Malley (JIRA) Thu, 11 Sep 2008 17:02:07 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630426#action_12630426
 ]


Owen O'Malley commented on HADOOP-3315:
---------------------------------------

I believe the primary advantages over SequenceFile are:
  * Support for large values (> heap size)
  * 1 Codec for compression/decompression instead of 4 for lower memory foot 
print
  * TFile doesn't require the value to be entirely buffered in RAM before being 
written
  * no required scanning for sync/block boundaries
  * number of columns included in header
  * metadata at end so can include data summary
  * replaces map files and sequence files with a single format
  * can support seek to row #
  * can sample keys very efficiently

I agree with Doug's comments on the Util class, which should be broken up into 
separate classes.

I would either stick with Hadoop vints or use protocol buffer vints. If we use 
protocol buffer vints, they should be named differently. My preference would be 
to stick with the current Hadoop vints.

Since a lot of the focus on TFile has been on making it performant, it would be 
nice to see a benchmark that uses a 100 byte Text as key and a 5k byte Text as 
value and benchmark SequenceFile relative to TFile. I'd suggest both with 
compression none and possibly lzo.




> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>         Attachments: HADOOP-3315_TFILE_PREVIEW.patch, 
> HADOOP-3315_TFILE_PREVIEW_WITH_LZO_TESTS.patch, TFile Specification Final.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs 
> to compress or decompress. It would be good to have a file format that only 
> needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3315) New binary file format

Reply via email to