RE: Hive Deserializer Interface

Alex Rovner Mon, 19 Jul 2010 10:42:56 -0700

Zheng,


Thank you for your reply. I do not fully agree with your statement.

 

Here is our situation:

 

In each partition there is more then one file. Each file has all the 
information that hive needs so as far as hive is concerned the schema is the 
same.  What lays in the files are just column headers.

 

For example:

 

Hive table schema:  AccountId, CategoryId, Impressions

 

Partition 1:

File 1  (Same as schema so the mapping is easy): 

AccountId, CategoryId, Impressions

100, 5, 1

120, 3, 1

 

File 2  (Same columns but in reverse order.): 

CategoryId, AccountId, Impressions

5, 100, 1

3, 120, 1

 

File 3(CategoryId is missing but we can use hives default):

AccountId, Impressions 

100, 1

120, 1

 

So technically each file can have a “different” schema but still be usable. I 
don’t think the limitation should be that the schema in each file should be the 
same. That is why Avro includes the schema in each file just like we do.

 

Any further ideas would be appreciated.

 

-- 

Thank You

Alex Rovner

 

From: Zheng Shao [mailto:[email protected]] 
Sent: Sunday, July 18, 2010 2:18 PM
To: [email protected]
Cc: <[email protected]>
Subject: Re: Hive Deserializer Interface

 

In hive (and all relational databases), schema of different rows in the same 
table is the same.

 

As a result, we should not put files with different schemas into the same table 
(or partition)


Sent from my iPhone


On Jul 17, 2010, at 9:33 PM, "Alex Rovner" <[email protected]> wrote:

        Hello,
        
        I was wondering if anyone can help me out with Hive InputFormat / 
Deserializer.
        
        I am trying to implement a custom file format which is similar to Avro: 
Each file will have the "schema" in the header.
        
        The issue I am having is that Hive's Deserializer interface doesn't 
have a way to read this "schema" because it doesn't have access to the input 
file.
        
        Some approaches that I have seen used by others but which do not work 
for me:
        
        1. Set SerDe properties on partition (This doesn't work as there is 
more then one file in each partition and they will have different schemas)
        2. Use config.get("map.input.file") in initialize method to read the 
schema (This will only work for mapreduce jobs. Simple queries in CLI will fail 
as this property will not be set)
        
        
        Does anyone have an idea on how this should be done?
        
        Thank You
        Alex Rovner

RE: Hive Deserializer Interface

Reply via email to