[
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630476#action_12630476
]
Hong Tang commented on HADOOP-3315:
-----------------------------------
Adding to Owen's comments.
- One of the hidden assumption of TFile design is that once data are stored in
TFile format, they will stay there for a long time, and may need to sustain
software changes/improvements. That is where the requirements of extensibility
and compatibility come from. Note that we are not only talking about back-ward
compatibility (new software read old data), but also forward compatibility (old
software read new data). Various mechanisms have been built into our design to
facilitate this goal (named meta blocks, length-guarded array entries, <offset,
length> region specification, etc). This is also the reason we are required to
provide a storage spec that goes down to the details of every bit.
- Another unstated design requirement is that the design must be performant.
This not only means that scanning/writing Tfiles be done at close to bulk I/O
throughput, it also means that we need to support reading/writing many Tfiles
concurrently, and thus we need to reduce our memory footprint. The internal
requirement is to support 100+ Tfiles with single digit MB memory footprint per
TFile, regardless the settings of minimum block size and/or value size.
- We also want to make sure data stored in TFile may be exchanged among
different groups who may have different preference of programming languages.
Admittedly, we cannot claim language neutrality until we actually implement it
in more than one programming language. However, having this objective from the
very beginning and reviewing it from time to time certainly helps. And the fact
that we are able to spec out the storage format down to every bit without
referring to Java or any other library gives us confidence for achieving this
goal.
With regard to the questions why we need so many lines for the implementation,
I did a quick run-down of the code:
- Over 1100 lines are coments.
- Over 500 lines are blank lines and import lines.
- Many classes and methods in Utils are fairly general and could be reused,
such as chunk encoder/decoder, lowerBound(), upperBound(),
BoundedByteArrayOutputStream, BoundedRangeFileInputStream, etc. They take about
400 lines of code (excluding blank lines and comments).
So that brings the actual TFile-specific code to 1200 lines. Among them,
- Supporting streaming-style key-value appending and reading. This adds some
adaptor classes to DataInputStream and DataOutputStream classes to perform
setup and finalization. There are about 120 lines of code for that. Streaming
not only helps performance by avoiding unnecessary buffering. It is also
relatively easier to use than asking users to write adaptor classes in Writable
interface.
- Defense against misuse of API (enforcing state transitions, etc) and resource
clean up takes about 100 lines of code.
- Serialization and deserialization of various meta data of Tfiles take about
250 lines of code.
So the code that deals with core TFile read/write logic is about 600-800 lines.
> New binary file format
> ----------------------
>
> Key: HADOOP-3315
> URL: https://issues.apache.org/jira/browse/HADOOP-3315
> Project: Hadoop Core
> Issue Type: New Feature
> Components: io
> Reporter: Owen O'Malley
> Assignee: Amir Youssefi
> Attachments: HADOOP-3315_TFILE_PREVIEW.patch,
> HADOOP-3315_TFILE_PREVIEW_WITH_LZO_TESTS.patch, TFile Specification Final.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs
> to compress or decompress. It would be good to have a file format that only
> needs
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.