[ 
https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629296#action_12629296
 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

I just wanted to post pseudo-code for this design that actually addresses this 
JIRA :) and not only self describing files like SequenceFile and Thrift's 
TRecordStream.

In the case for this JIRA, the file's metadata is stored in some external store 
or dictionary or something.  The only way to lookup the file would be through 
the filename/path, so I think it's fair that on job submission, the mapping is 
put in the JobConf.

Given this use case, and looking at line 43 of SequenceFileRecordReader (   
this.in = new SequenceFile.Reader(fs, path, conf); ), the TypeFile        
interface should be changed:

- public void initialize(Configuration conf, InputStream in);                   
                                                                                
                      
+ public void initialize(FileSystem, Path, Configuration);

Obviously it has top open the inputstream anyway ( :) ).  

And a typo SequenceFile would not implement SplittableTypedFile, 
SequenceFile.Reader would.
                                         





> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary 
> records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set 
> splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration 
> somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not 
> have the deserializer throw an exception  (which is hard to distinguish from 
> a exception due to corruptions?)). this is easy for non-compressed streams - 
> for compressed streams - DecompressorStream has a useful looking 
> getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to