[ 
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592878#action_12592878
 ] 

Srikanth Kakani commented on HADOOP-3315:
-----------------------------------------

> Doug Cutting - 28/Apr/08 09:27 AM
> Do you expect to make this a drop-in replacement for SequenceFile and 
> MapFile, or rather something that we expect code to migrate to? I'm guessing 
> the latter.

I think it will be the latter as well.

> So we should have a magic number there too, but, if it doesn't harm things, 
> I'd prefer leaving an (frequently unread) magic number in the header too.
When will the header magic be read? If it is always then wouldnt it result in 
two seeks anyways? If not why do we have to complicate the format?

> Jim Kellerman - 26/Apr/08 10:19 AM > 
> Dropping the record length would seriously slow down random reads unless the 
> index is 'complete', i.e., every key/offset is represented. If the index is 
> sparse like MapFile's, you would only get an approximate location of the 
> desired record and then have to do a lot of work to seek forward to the 
> desired one.

Each block in this file would be memory loadable, it doesnt really matter 
(much) if we store key length or not as the total operation is bounded by seek 
and read. Even going through the index with variable sized keys is linear. 
Maybe bounding the index to one read block makes sense aswell.

The only case this can change is if we have some metadata about the key being 
fixed size: in which case all the seek-to-keys are O(1)

One more thought is an index purely based on record ids (fixed size encoded) 
that may keep the index skippable/seekable. 




> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>
> SequenceFile's block compression format is too complex and requires 4 codecs 
> to compress or decompress. It would be good to have a file format that only 
> needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to