[ 
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630476#action_12630476
 ] 

Hong Tang commented on HADOOP-3315:
-----------------------------------

Adding to Owen's comments.
- One of the hidden assumption of TFile design is that once data are stored in 
TFile format, they will stay there for a long time, and may need to sustain 
software changes/improvements. That is where the requirements of extensibility 
and compatibility come from. Note that we are not only talking about back-ward 
compatibility (new software read old data), but also forward compatibility (old 
software read new data). Various mechanisms have been built into our design to 
facilitate this goal (named meta blocks, length-guarded array entries, <offset, 
length> region specification, etc). This is also the reason we are required to 
provide a storage spec that goes down to the details of every bit.
- Another unstated design requirement is that the design must be performant. 
This not only means that scanning/writing Tfiles be done at close to bulk I/O 
throughput, it also means that we need to support reading/writing many Tfiles 
concurrently, and thus we need to reduce our memory footprint. The internal 
requirement is to support 100+ Tfiles with single digit MB memory footprint per 
TFile, regardless the settings of minimum block size and/or value size.
- We also want to make sure data stored in TFile may be exchanged among 
different groups who may have different preference of programming languages. 
Admittedly, we cannot claim language neutrality until we actually implement it 
in more than one programming language. However, having this objective from the 
very beginning and reviewing it from time to time certainly helps. And the fact 
that we are able to spec out the storage format down to every bit without 
referring to Java or any other library gives us confidence for achieving this 
goal.

With regard to the questions why we need so many lines for the implementation, 
I did a quick run-down of the code:
- Over 1100 lines are coments.
- Over 500 lines are blank lines and import lines.
- Many classes and methods in Utils are fairly general and could be reused, 
such as chunk encoder/decoder, lowerBound(), upperBound(), 
BoundedByteArrayOutputStream, BoundedRangeFileInputStream, etc. They take about 
400 lines of code (excluding blank lines and comments).

So that brings the actual TFile-specific code to 1200 lines. Among them,
- Supporting streaming-style key-value appending and reading. This adds some 
adaptor classes to DataInputStream and DataOutputStream classes to perform 
setup and finalization. There are about 120 lines of code for that. Streaming 
not only helps performance by avoiding unnecessary buffering. It is also 
relatively easier to use than asking users to write adaptor classes in Writable 
interface. 
- Defense against misuse of API (enforcing state transitions, etc) and resource 
clean up takes about 100 lines of code.
- Serialization and deserialization of various meta data of Tfiles take about 
250 lines of code.
So the code that deals with core TFile read/write logic is about 600-800 lines.


> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>         Attachments: HADOOP-3315_TFILE_PREVIEW.patch, 
> HADOOP-3315_TFILE_PREVIEW_WITH_LZO_TESTS.patch, TFile Specification Final.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs 
> to compress or decompress. It would be good to have a file format that only 
> needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to