[
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630426#action_12630426
]
Owen O'Malley commented on HADOOP-3315:
---------------------------------------
I believe the primary advantages over SequenceFile are:
* Support for large values (> heap size)
* 1 Codec for compression/decompression instead of 4 for lower memory foot
print
* TFile doesn't require the value to be entirely buffered in RAM before being
written
* no required scanning for sync/block boundaries
* number of columns included in header
* metadata at end so can include data summary
* replaces map files and sequence files with a single format
* can support seek to row #
* can sample keys very efficiently
I agree with Doug's comments on the Util class, which should be broken up into
separate classes.
I would either stick with Hadoop vints or use protocol buffer vints. If we use
protocol buffer vints, they should be named differently. My preference would be
to stick with the current Hadoop vints.
Since a lot of the focus on TFile has been on making it performant, it would be
nice to see a benchmark that uses a 100 byte Text as key and a 5k byte Text as
value and benchmark SequenceFile relative to TFile. I'd suggest both with
compression none and possibly lzo.
> New binary file format
> ----------------------
>
> Key: HADOOP-3315
> URL: https://issues.apache.org/jira/browse/HADOOP-3315
> Project: Hadoop Core
> Issue Type: New Feature
> Components: io
> Reporter: Owen O'Malley
> Assignee: Amir Youssefi
> Attachments: HADOOP-3315_TFILE_PREVIEW.patch,
> HADOOP-3315_TFILE_PREVIEW_WITH_LZO_TESTS.patch, TFile Specification Final.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs
> to compress or decompress. It would be good to have a file format that only
> needs
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.