Re: HCatOutputFormat schema issues

Charles Menguy Mon, 31 Oct 2011 20:19:24 -0700

Hi Ashutosh,

Thank you very much for your answer.


I can certainly understand your argument. Is there however a way to get the
schema from the output table, so we could potentially create a dynamic
mapping of fields you want to write to and the actual schema? If not, is
there any standard way to be able to accomplish what I described, other
than hardcoding the positions of the columns in the code (bad for code
reusability)? Any alternative would be helpful as well.

Thanks in advance !

Charles

On Mon, Oct 31, 2011 at 8:37 PM, Ashutosh Chauhan <[email protected]>wrote:

> Hey Charles,
>
> Yeah, you need to call setOutputSchema() on HCatOutputFormat explicitly.
> Though we could assume defaults we don't because of the following reason.
> While writing rows they may either contain partition columns or they may
> not. HCatOutputFormat will transparently weed out partition columns if they
> are present in the row. If we assume defaults then we have to assume that
> data does not contain partition columns (we dont store partition columns in
> data) which is a dangerous assumption to make which will screw things up
> when we read back. So, instead we ask user to set the schema. You are also
> correct order of columns should be same as the one you have declared while
> creating tables.
>
> Hope it helps,
> Ashutosh
>
>
> On Mon, Oct 31, 2011 at 14:54, Charles Menguy <
> [email protected]> wrote:
>
>> Hi,
>>
>> I've been playing with HCatalog for the past couple weeks now, and I have
>> a few questions regarding schemas in MR jobs.
>>
>> From what I read in the documentation, schemas are optional, and if not
>> specified it defaults to the table level schemas. Here are some extracts
>> from the documentation:
>> You can use the setOutputSchema method to include a projection schema,
>> to specify specific output fields. If a schema is not specified, this
>> default to the table level schema.
>> The schema for the data being written out is specified by the setSchema 
>> method.
>> If this is not called on the HCatOutputFormat, then by default it is
>> assumed that the the partition has the same schema as the current table
>> level schema
>>
>> Now when I try to omit the schema for HCatInputFormat, it works fine and
>> assumes the default.
>> But when I try to omit the schema for HCatOutputFormat, I get the
>> following error: org.apache.hcatalog.common.HCatException : 9001 :
>> Exception occurred while processing HCat request : It seems that
>> setSchema() is not called on HCatOutputFormat. Please make sure that method
>> is called.
>> From what I read, it expects that I explicitely define the schema with
>> HCatOutputFormat.setSchema(...), but this is exactly what I would like to
>> omit to assume defaults.
>>
>> This is actually important because it seems that to define the schema,
>> you have to know the order of your table columns in which you specify your
>> List<HCatFieldSchema>, which may not always be obvious.
>>
>> Here is how I create my output table in Hive, which works fine when I'm
>> manipulating it while specifying the schema:
>> hive> create table inventory(word STRING, author STRING, frequency INT)
>> stored as RCFILE;
>>
>> I would like to know if I'm doing something wrong, or if this is simply
>> something not yet implemented in 0.2? Any thoughts would be useful.
>>
>> Thanks,
>>
>> Charles
>>
>>
>

Re: HCatOutputFormat schema issues

Reply via email to