[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Joydeep Sen Sarma (JIRA) Thu, 18 Sep 2008 23:01:07 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632529#action_12632529
 ]


Joydeep Sen Sarma commented on HADOOP-4065:
-------------------------------------------

yes - given that this has no dependency on core hadoop now - i really don't 
care - we can put this into Hive. The generic ThriftDeserializer is trivial - 
we could duplicate the code for now and then remove it once 3787 provides those 
classes as well.

btw - we also don't store data in this manner. agree with all your 
observations. however 
- this requested originated from outside Hive/Facebook. I get the impression 
(perhaps wrong) that quite a few people just dump thrift logs into a flat file 
(just like people dump apache logs into a flat file). This is also because 
Thrift does not have (so far) a good framed file format.
- the same counter argument can be made for TextFileInputFormat. The general 
observation is that data originates outside the hadoop ecosystem and the 
general format it originates in is flat files. We should strive the easiest way 
to absorb this data and transform it into a better one (like Sequencefile). 

That is the general effort with Hive at least. We expect users to create 
temporary tables by pointing to flat files. And then quickly do some 
transformations (using sql and potentially scripts) and load it into tables in 
sequencefile (like) format (for longer term storage).  Being able to point to 
thrift flat files(and potentially other binary files)  is part of the data 
integration story.

> Furthermore, since the types have to be configured, you can't use multiple 
> ones in different contexts. 

not sure what u mean - but this is not true. the deserializer is obtained from 
a combination of file name and file name->deserializer metadata from an 
external source. Different files can be read using different deserializers and 
then operated on in the same map-reduce program (the application logic has 
logic to deal with different classes based on the file name).  we will only be 
too happy to demonstrate a join of two different thrift classes (in different 
files/tables) using Hive and a generic flat file reader like this.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, 
> HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary 
> records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set 
> splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration 
> somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not 
> have the deserializer throw an exception  (which is hard to distinguish from 
> a exception due to corruptions?)). this is easy for non-compressed streams - 
> for compressed streams - DecompressorStream has a useful looking 
> getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Reply via email to