Zheng,
Thank you for your reply. I do not fully agree with your statement. Here is our situation: In each partition there is more then one file. Each file has all the information that hive needs so as far as hive is concerned the schema is the same. What lays in the files are just column headers. For example: Hive table schema: AccountId, CategoryId, Impressions Partition 1: File 1 (Same as schema so the mapping is easy): AccountId, CategoryId, Impressions 100, 5, 1 120, 3, 1 File 2 (Same columns but in reverse order.): CategoryId, AccountId, Impressions 5, 100, 1 3, 120, 1 File 3(CategoryId is missing but we can use hives default): AccountId, Impressions 100, 1 120, 1 So technically each file can have a “different” schema but still be usable. I don’t think the limitation should be that the schema in each file should be the same. That is why Avro includes the schema in each file just like we do. Any further ideas would be appreciated. -- Thank You Alex Rovner From: Zheng Shao [mailto:[email protected]] Sent: Sunday, July 18, 2010 2:18 PM To: [email protected] Cc: <[email protected]> Subject: Re: Hive Deserializer Interface In hive (and all relational databases), schema of different rows in the same table is the same. As a result, we should not put files with different schemas into the same table (or partition) Sent from my iPhone On Jul 17, 2010, at 9:33 PM, "Alex Rovner" <[email protected]> wrote: Hello, I was wondering if anyone can help me out with Hive InputFormat / Deserializer. I am trying to implement a custom file format which is similar to Avro: Each file will have the "schema" in the header. The issue I am having is that Hive's Deserializer interface doesn't have a way to read this "schema" because it doesn't have access to the input file. Some approaches that I have seen used by others but which do not work for me: 1. Set SerDe properties on partition (This doesn't work as there is more then one file in each partition and they will have different schemas) 2. Use config.get("map.input.file") in initialize method to read the schema (This will only work for mapreduce jobs. Simple queries in CLI will fail as this property will not be set) Does anyone have an idea on how this should be done? Thank You Alex Rovner
