Re: HCatOutputFormat schema issues

Ashutosh Chauhan Tue, 01 Nov 2011 11:30:04 -0700

Sure. Try it out and let us know how it goes. In the meanwhile, we will get
docs fixed.


Ashutosh
On Tue, Nov 1, 2011 at 10:59, Charles Menguy
<[email protected]>wrote:

> Thanks for the information Ashutosh, I'll try what you're suggesting but
> this sounds like a good solution for now.
>
> And yes I agree with Thomas, it would be a good idea to fix the following
> line in the documentation as this is pretty confusing:
> The schema for the data being written out is specified by the setSchema 
> method.
> If this is not called on the HCatOutputFormat, then by default it is
> assumed that the the partition has the same schema as the current table
> level schema.
>
> Thanks for the help !
>
> Charles
>
> On Tue, Nov 1, 2011 at 1:42 PM, Thomas Weise <[email protected]> wrote:
>
>>  We should fix the documentation then?
>>
>> http://incubator.apache.org/hcatalog/docs/r0.2.0/inputoutput.html
>>
>>
>>
>> On 11/1/11 9:13 AM, "Ashutosh Chauhan" <[email protected]> wrote:
>>
>> Hey Charles,
>>
>> After you have done HCatOutputFormat.setOutput(), you can do
>> HCatOutputFormat.getTableSchema() which will return you the schema of table
>> which you can then use without requiring you to manually construct the
>> Schema.
>>
>> Hope it helps,
>> Ashutosh
>>
>> On Mon, Oct 31, 2011 at 20:18, Charles Menguy <
>> [email protected]> wrote:
>>
>> Hi Ashutosh,
>>
>> Thank you very much for your answer.
>>
>> I can certainly understand your argument. Is there however a way to get
>> the schema from the output table, so we could potentially create a
>> dynamic mapping of fields you want to write to and the actual schema? If
>> not, is there any standard way to be able to accomplish what I described,
>> other than hardcoding the positions of the columns in the code (bad for
>> code reusability)? Any alternative would be helpful as well.
>>
>> Thanks in advance !
>>
>> Charles
>>
>> On Mon, Oct 31, 2011 at 8:37 PM, Ashutosh Chauhan <[email protected]>
>> wrote:
>>
>> Hey Charles,
>>
>> Yeah, you need to call setOutputSchema() on HCatOutputFormat explicitly.
>> Though we could assume defaults we don't because of the following reason.
>> While writing rows they may either contain partition columns or they may
>> not. HCatOutputFormat will transparently weed out partition columns if they
>> are present in the row. If we assume defaults then we have to assume that
>> data does not contain partition columns (we dont store partition columns in
>> data) which is a dangerous assumption to make which will screw things up
>> when we read back. So, instead we ask user to set the schema. You are also
>> correct order of columns should be same as the one you have declared while
>> creating tables.
>>
>> Hope it helps,
>> Ashutosh
>>
>>
>> On Mon, Oct 31, 2011 at 14:54, Charles Menguy <
>> [email protected]> wrote:
>>
>> Hi,
>>
>> I've been playing with HCatalog for the past couple weeks now, and I have
>> a few questions regarding schemas in MR jobs.
>>
>> From what I read in the documentation, schemas are optional, and if not
>> specified it defaults to the table level schemas. Here are some extracts
>> from the documentation:
>> You can use the setOutputSchema method to include a projection schema,
>> to specify specific output fields. If a schema is not specified, this
>> default to the table level schema.
>> The schema for the data being written out is specified by the setSchema 
>> method.
>> If this is not called on the HCatOutputFormat, then by default it is
>> assumed that the the partition has the same schema as the current table
>> level schema
>>
>> Now when I try to omit the schema for HCatInputFormat, it works fine and
>> assumes the default.
>> But when I try to omit the schema for HCatOutputFormat, I get the
>> following error: org.apache.hcatalog.common.HCatException : 9001 :
>> Exception occurred while processing HCat request : It seems that
>> setSchema() is not called on HCatOutputFormat. Please make sure that method
>> is called.
>> From what I read, it expects that I explicitely define the schema with
>> HCatOutputFormat.setSchema(...), but this is exactly what I would like to
>> omit to assume defaults.
>>
>> This is actually important because it seems that to define the schema,
>> you have to know the order of your table columns in which you specify your
>> List<HCatFieldSchema>, which may not always be obvious.
>>
>> Here is how I create my output table in Hive, which works fine when I'm
>> manipulating it while specifying the schema:
>> hive> create table inventory(word STRING, author STRING, frequency INT)
>> stored as RCFILE;
>>
>> I would like to know if I'm doing something wrong, or if this is simply
>> something not yet implemented in 0.2? Any thoughts would be useful.
>>
>> Thanks,
>>
>> Charles
>>
>>
>>
>>
>>
>>
>>
>
>

Re: HCatOutputFormat schema issues

Reply via email to