[
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633976#action_12633976
]
Hong Tang commented on HADOOP-3315:
-----------------------------------
bq. + How about defines for at least the common compression types and
comparator name(s) at least for the common case where TFile is used by java?
Agree.
bq. + If no compression, does that mean there is only one block in a file or do
we still make blocks of size minBlockSize (raw size == compressed size)?
Multiple blocks, raw size == compressed size.
bq. + If I wanted to ornament the index - say, I wanted to add a metadata block
per BCFile block that had in it the offset of every added key (or the offset of
every 'row' in hbase) in the name of improving random access speeds - it looks
like I would override prepareAppendKey and then do my own KeyRegister class
that keeps up the per-block index? KeyRegister is currently private. Can it be
made subclassable?
That is a usage case I have not thought about. A (slightly) less performant way
I can recommend is to write your own key appender and value appender classes as
filter stream on top of the key/value appending streams returned by TFile, and
add customized actions in close() (before/after calling close() on the down
stream).
bq. advanceCursorInBlock is also private which doesn't help if I want to
exploit my ancillary-index info. Or what would you suggest if I want to make a
more-involved index (I can't use the BCFile block index since key/values might
be of variable size - or, maybe I can set the blocksize to zero and index every
element?).
Hmm, not clear how intercepting advanceCursorInBlock may help. Would it be
suffice for you to know a seek() call moves forward or backward by how many
<key, value> pairs?
bq. To add support for alternate comparators and for exposing the index at
least to subclasses, should we add a patch atop your patch or just wait till
whats here gets committed?
I'd say let's wait it gets committed. Also, to keep in line with the original
design objective, language-specific customized comparators should have string
names like "jclass:path/to/java/package/ClassName", or
"clib:path/to/C/library/functionName".
bq. It looks like I could do an in-memory TFile if I wanted since I provide the
stream? Is that so? If so, thats sweet!
Guess so. (We haven't thought about this usage though.)
> New binary file format
> ----------------------
>
> Key: HADOOP-3315
> URL: https://issues.apache.org/jira/browse/HADOOP-3315
> Project: Hadoop Core
> Issue Type: New Feature
> Components: io
> Reporter: Owen O'Malley
> Assignee: Amir Youssefi
> Attachments: HADOOP-3315_20080908_TFILE_PREVIEW_WITH_LZO_TESTS.patch,
> HADOOP-3315_20080915_TFILE.patch, TFile Specification Final.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs
> to compress or decompress. It would be good to have a file format that only
> needs
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.