[ 
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592839#action_12592839
 ] 

Doug Cutting commented on HADOOP-3315:
--------------------------------------

> Owen: I can't see any cases where we need the key length

I think we use it when sorting.  We read raw keys into memory and call raw 
comparators on them, passing the key length to the raw comparator.

The sorting code pre-dates MapReduce, though, from back when we manually 
partitioned, shuffled, sorted, merged, etc. things in Nutch.  Perhaps we no 
longer need to sort and merge files directly, since folks can use MapReduce for 
that.  Does any application code still use SequenceFile#Sorter?

> Owen: my desire is to make applications not read the header and only read the 
> tail

So we should have a magic number there too, but, if it doesn't harm things, I'd 
prefer leaving an (frequently unread) magic number in the header too.

Do you expect to make this a drop-in replacement for SequenceFile and MapFile, 
or rather something that we expect code to migrate to?  I'm guessing the latter.

> Alejandro: Our use case is specifically the example he mentions, the record 
> count.

I think we can include record-count as a base feature of this file format.  But 
permitting a Map<String,String> of other metadata might also be good.



> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>
> SequenceFile's block compression format is too complex and requires 4 codecs 
> to compress or decompress. It would be good to have a file format that only 
> needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to