[ 
https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Description: 
Implement generic FlatFileDeserializationRecordReader which assumes a 
Serialization Implementation is specific in the JobConf and that once 
instantiated, that Serialization Implementation can  figure out the actual 
class being Deserialized from the JobConf.  e.g., the JobConf specifies 
RecordIOSerialization and then the specific class is LogRecordObject. 

Another way one might to do this is to use the SerializationFactory to do the 
lookup of the Serialization Implementation; however, this requires all 
Deserializers to be known apriori and registered and goes against the spirit of 
a very generic FlatFileDeserializeRecordReader.

To ensure it is generic, I propose implementing the following Serialization 
implementations:

1. RecordIOSerialization
2. LineReaderSerialization
3. ThriftSerialization

The first 2 should go in io/serialization and the Thrift one in contrib 
somewhere. 



  was:
like textinputformat - looking for a concrete implementation to read binary 
records from a flat file (that may be compressed).

it's assumed that hadoop can't split such a file. so the inputformat can set 
splittable to false.

tricky aspects are:
- how to know what class the file contains (has to be in a configuration 
somewhere).
- how to determine EOF (would be nice if hadoop can determine EOF and not have 
the deserializer throw an exception  (which is hard to distinguish from a 
exception due to corruptions?)). this is easy for non-compressed streams - for 
compressed streams - DecompressorStream has a useful looking getAvailable() 
call - except the class is marked package private.


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> Implement generic FlatFileDeserializationRecordReader which assumes a 
> Serialization Implementation is specific in the JobConf and that once 
> instantiated, that Serialization Implementation can  figure out the actual 
> class being Deserialized from the JobConf.  e.g., the JobConf specifies 
> RecordIOSerialization and then the specific class is LogRecordObject. 
> Another way one might to do this is to use the SerializationFactory to do the 
> lookup of the Serialization Implementation; however, this requires all 
> Deserializers to be known apriori and registered and goes against the spirit 
> of a very generic FlatFileDeserializeRecordReader.
> To ensure it is generic, I propose implementing the following Serialization 
> implementations:
> 1. RecordIOSerialization
> 2. LineReaderSerialization
> 3. ThriftSerialization
> The first 2 should go in io/serialization and the Thrift one in contrib 
> somewhere. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to