Re: HCatOutputFormat schema issues

Thomas Weise Tue, 01 Nov 2011 10:43:37 -0700

We should fix the documentation then?

http://incubator.apache.org/hcatalog/docs/r0.2.0/inputoutput.html

On 11/1/11 9:13 AM, "Ashutosh Chauhan" <[email protected]> wrote:

Hey Charles,

After you have done HCatOutputFormat.setOutput(), you can do 
HCatOutputFormat.getTableSchema() which will return you the schema of table 
which you can then use without requiring you to manually construct the Schema.

Hope it helps,
Ashutosh

On Mon, Oct 31, 2011 at 20:18, Charles Menguy <[email protected]> 
wrote:
Hi Ashutosh,

Thank you very much for your answer.

I can certainly understand your argument. Is there however a way to get the 
schema from the output table, so we could potentially create a dynamic mapping 
of fields you want to write to and the actual schema? If not, is there any 
standard way to be able to accomplish what I described, other than hardcoding 
the positions of the columns in the code (bad for code reusability)? Any 
alternative would be helpful as well.

Thanks in advance !

Charles

On Mon, Oct 31, 2011 at 8:37 PM, Ashutosh Chauhan <[email protected]> wrote:
Hey Charles,

Yeah, you need to call setOutputSchema() on HCatOutputFormat explicitly. Though 
we could assume defaults we don't because of the following reason. While 
writing rows they may either contain partition columns or they may not. 
HCatOutputFormat will transparently weed out partition columns if they are 
present in the row. If we assume defaults then we have to assume that data does 
not contain partition columns (we dont store partition columns in data) which 
is a dangerous assumption to make which will screw things up when we read back. 
So, instead we ask user to set the schema. You are also correct order of 
columns should be same as the one you have declared while creating tables.

Hope it helps,
Ashutosh

On Mon, Oct 31, 2011 at 14:54, Charles Menguy <[email protected]> 
wrote:
Hi,

I've been playing with HCatalog for the past couple weeks now, and I have a few 
questions regarding schemas in MR jobs.

>From what I read in the documentation, schemas are optional, and if not 
>specified it defaults to the table level schemas. Here are some extracts from 
>the documentation:
You can use the setOutputSchema method to include a projection schema, to 
specify specific output fields. If a schema is not specified, this default to 
the table level schema.
The schema for the data being written out is specified by the setSchema method. 
If this is not called on the HCatOutputFormat, then by default it is assumed 
that the the partition has the same schema as the current table level schema

Now when I try to omit the schema for HCatInputFormat, it works fine and 
assumes the default.
But when I try to omit the schema for HCatOutputFormat, I get the following 
error: org.apache.hcatalog.common.HCatException : 9001 : Exception occurred 
while processing HCat request : It seems that setSchema() is not called on 
HCatOutputFormat. Please make sure that method is called.
>From what I read, it expects that I explicitely define the schema with 
>HCatOutputFormat.setSchema(...), but this is exactly what I would like to omit 
>to assume defaults.

This is actually important because it seems that to define the schema, you have 
to know the order of your table columns in which you specify your 
List<HCatFieldSchema>, which may not always be obvious.

Here is how I create my output table in Hive, which works fine when I'm 
manipulating it while specifying the schema:
hive> create table inventory(word STRING, author STRING, frequency INT) stored 
as RCFILE;

I would like to know if I'm doing something wrong, or if this is simply 
something not yet implemented in 0.2? Any thoughts would be useful.

Thanks,

Charles

Re: HCatOutputFormat schema issues

Reply via email to