[ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633566#action_12633566 ]
Pete Wyckoff commented on HADOOP-4065: -------------------------------------- This is what HADOOP-3566 looks like as an instance of a FlatFileRecordReader (with signature <Void, String>, not <String, Void>). This assumes there is a StringSerialization implementation (based on LineRecordReader) and that HADOOP-1230 is implemented. But, it should hopefully demonstrate that FlatFileRecordReader can be used for non binary records. Although, without this, it can still be be used for anything that implements the Serialization interface. {code:title=StringInputFormat.java} public class StringInputFormat extends FileInputFormat<Void, String> implements JobConfigurable { private CompressionCodecFactory compressionCodecs = null; public void configure(JobConf conf) { compressionCodecs = new CompressionCodecFactory(conf); } protected boolean isSplittable(FileSystem fs, Path file) { return compressionCodecs.getCodec(file) == null; } public RecordReader<Void, String> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(split.toString()); // // Set this so the SerializerFromConf can lookup our deserializer. // job.setClass(FlatFileRecordReader.SerializationContextFromConf.SerializationImplKey, org.apache.hadoop.contrib.serialization.string.StringSerialization.class, org.apache.hadoop.io.Serialization.class); job.setClass(FlatFileRecordReader.SerializationContextFromConf.SerializationSubclassKey, java.lang.String.class, java.lang.String.class); return new FlatFileRecordReader<String>(job, (FileSplit) split); } } {code} > support for reading binary data from flat files > ----------------------------------------------- > > Key: HADOOP-4065 > URL: https://issues.apache.org/jira/browse/HADOOP-4065 > Project: Hadoop Core > Issue Type: Bug > Components: contrib/serialization, mapred > Reporter: Joydeep Sen Sarma > Attachments: FlatFileReader.java, HADOOP-4065.0.txt, > HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java > > > like textinputformat - looking for a concrete implementation to read binary > records from a flat file (that may be compressed). > it's assumed that hadoop can't split such a file. so the inputformat can set > splittable to false. > tricky aspects are: > - how to know what class the file contains (has to be in a configuration > somewhere). > - how to determine EOF (would be nice if hadoop can determine EOF and not > have the deserializer throw an exception (which is hard to distinguish from > a exception due to corruptions?)). this is easy for non-compressed streams - > for compressed streams - DecompressorStream has a useful looking > getAvailable() call - except the class is marked package private. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.