[ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632058#action_12632058 ]
Joydeep Sen Sarma commented on HADOOP-4065: ------------------------------------------- Hi Owen - the motivation was based on different families of binary (or even non binary) data embedded within flat files (by which i mean they are unsplittable and not self-describing (except for compression). - We should be able to write one concrete implementation that covers no-splits and compression related code - One should be able to plug in different deserializers for different binary formats The desire was that once this is written out - different deserializers can be plugged in easily. In that sense - this does not follow the general pattern that you observed of having to write custom code to deal with splitting (since there's no splitting here). Existing interfaces should not have to be changed (although things got pretty complicated in the intermediate discussion) - and i don't think they are although i am going to send back feedback on the code separately. The code should be really simple i would think. Do you think this is a reasonable thing to add? > support for reading binary data from flat files > ----------------------------------------------- > > Key: HADOOP-4065 > URL: https://issues.apache.org/jira/browse/HADOOP-4065 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Reporter: Joydeep Sen Sarma > Attachments: HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java > > > like textinputformat - looking for a concrete implementation to read binary > records from a flat file (that may be compressed). > it's assumed that hadoop can't split such a file. so the inputformat can set > splittable to false. > tricky aspects are: > - how to know what class the file contains (has to be in a configuration > somewhere). > - how to determine EOF (would be nice if hadoop can determine EOF and not > have the deserializer throw an exception (which is hard to distinguish from > a exception due to corruptions?)). this is easy for non-compressed streams - > for compressed streams - DecompressorStream has a useful looking > getAvailable() call - except the class is marked package private. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.