[ 
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629916#action_12629916
 ] 

Doug Cutting commented on HADOOP-3315:
--------------------------------------

Looking at the specification document, I see the major stated goals are (1) 
language neutrality; (2) extensibility, and (3) compatibility.  I assume these 
are relative to SequenceFile.

Langauge neutrality without an implementation in another language seems a risky 
claim.  SequenceFile's only language dependence is in the naming of key and 
value classes, but implementations of these classes are not required to process 
a SequenceFile.  SequenceFile, like TFile, lacks implementations in other 
languages, so I don't yet see a clear advantage there.

(2) and (3) are very related.  SequenceFile has proven extensible and 
back-compatible.  Many features have been added without breaking 
back-compatibility.  I don't see a qualitative advantage here to the TFile 
format.

Perhaps you should include a section specifically addressing the advantages of 
TFile over SequenceFile, how they are achieved and how they can be measured.

I suspect there may be other unstated goals in TFile.  The case for TFile 
should be clearly made, as it adds a lot of code to Hadoop that must now be 
supported.  If it has demonstrable advantages to SequenceFile and the case can 
be made that we will be able to retire SequenceFile after it is added, then 
TFile should go forward.  Or if it is significantly simpler than SequenceFile 
while providing the same features, that might make the case that it will be 
easier to reimplement in other languages.  But if it is equivalently complex 
and supports more-or-less the same features then it only adds baggage to the 
project.


> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>         Attachments: HADOOP-3315_TFILE_PREVIEW.patch, 
> HADOOP-3315_TFILE_PREVIEW_WITH_LZO_TESTS.patch, TFile Specification Final.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs 
> to compress or decompress. It would be good to have a file format that only 
> needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to