[ 
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633976#action_12633976
 ] 

Hong Tang commented on HADOOP-3315:
-----------------------------------

bq. + How about defines for at least the common compression types and 
comparator name(s) at least for the common case where TFile is used by java?
Agree.

bq. + If no compression, does that mean there is only one block in a file or do 
we still make blocks of size minBlockSize (raw size == compressed size)?
Multiple blocks, raw size == compressed size.

bq. + If I wanted to ornament the index - say, I wanted to add a metadata block 
per BCFile block that had in it the offset of every added key (or the offset of 
every 'row' in hbase) in the name of improving random access speeds - it looks 
like I would override prepareAppendKey and then do my own KeyRegister class 
that keeps up the per-block index? KeyRegister is currently private. Can it be 
made subclassable? 
That is a usage case I have not thought about. A (slightly) less performant way 
I can recommend is to write your own key appender and value appender classes as 
filter stream on top of the key/value appending streams returned by TFile, and 
add customized actions in close() (before/after calling close() on the down 
stream).

bq. advanceCursorInBlock is also private which doesn't help if I want to 
exploit my ancillary-index info. Or what would you suggest if I want to make a 
more-involved index (I can't use the BCFile block index since key/values might 
be of variable size - or, maybe I can set the blocksize to zero and index every 
element?).
Hmm, not clear how intercepting advanceCursorInBlock may help. Would it be 
suffice for you to know a seek() call moves forward or backward by how many 
<key, value> pairs?

bq. To add support for alternate comparators and for exposing the index at 
least to subclasses, should we add a patch atop your patch or just wait till 
whats here gets committed?
I'd say let's wait it gets committed. Also, to keep in line with the original 
design objective, language-specific customized comparators should have string 
names like "jclass:path/to/java/package/ClassName", or
"clib:path/to/C/library/functionName".

bq. It looks like I could do an in-memory TFile if I wanted since I provide the 
stream? Is that so? If so, thats sweet!
Guess so. (We haven't thought about this usage though.)

> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>         Attachments: HADOOP-3315_20080908_TFILE_PREVIEW_WITH_LZO_TESTS.patch, 
> HADOOP-3315_20080915_TFILE.patch, TFile Specification Final.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs 
> to compress or decompress. It would be good to have a file format that only 
> needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to