Re: HCatOutputFormat schema issues

Ashutosh Chauhan Wed, 02 Nov 2011 13:52:35 -0700

You can get the partition columns as follows:

HCatOutputFormat.setOutput()
HCatOutputFormat.getTableSchema() // gets you the data columns
HCatOutputFormat.getJobInfo().getTableInfo().getPartitionColumns() // this
will get you partition columns.


Hope it helps,
Ashutosh

On Wed, Nov 2, 2011 at 07:17, Charles Menguy
<[email protected]>wrote:

> This works fine when using a non partitioned table, I can just set the
> schema to the schema of the table using something like
> HCatOutputFormat.setSchema(job, HCatOutputFormat.getTableSchema(job));
>
> For a partitioned table however, as you explained, the getTableSchema call
> will only return the non partition column, and this method will fail as
> expected, because you have to specifically ask it to add the partition
> columns in the schema, and this works fine. For this, I currently manually
> add the partitions to the table schema, which is a bit tedious. Is there by
> any chance a way to get the list of partition from HCatOutputFormat or
> anywhere else, so I can just get the list of partitions from the table
> schema, add them to the actual schema, set the schema, and be done? Or will
> I still have to do it manually?
>
> I also noticed that there is no way to get the actual schema from the
> HCatOutputFormat. You can get the table schema by calling getTableSchema,
> which is great, but I don't see a way to get the actual schema we are
> setting this way. This is not critical, but I just wanted to mention it.
>
> Thanks for the support on this particular issue, that was very helpful  !
>
> Charles
>
> On Tue, Nov 1, 2011 at 2:29 PM, Ashutosh Chauhan <[email protected]>wrote:
>
>> Sure. Try it out and let us know how it goes. In the meanwhile, we will
>> get docs fixed.
>>
>> Ashutosh
>>
>> On Tue, Nov 1, 2011 at 10:59, Charles Menguy <
>> [email protected]> wrote:
>>
>>> Thanks for the information Ashutosh, I'll try what you're suggesting but
>>> this sounds like a good solution for now.
>>>
>>> And yes I agree with Thomas, it would be a good idea to fix the
>>> following line in the documentation as this is pretty confusing:
>>> The schema for the data being written out is specified by the setSchema 
>>> method.
>>> If this is not called on the HCatOutputFormat, then by default it is
>>> assumed that the the partition has the same schema as the current table
>>> level schema.
>>>
>>> Thanks for the help !
>>>
>>> Charles
>>>
>>> On Tue, Nov 1, 2011 at 1:42 PM, Thomas Weise <[email protected]> wrote:
>>>
>>>>  We should fix the documentation then?
>>>>
>>>> http://incubator.apache.org/hcatalog/docs/r0.2.0/inputoutput.html
>>>>
>>>>
>>>>
>>>> On 11/1/11 9:13 AM, "Ashutosh Chauhan" <[email protected]> wrote:
>>>>
>>>> Hey Charles,
>>>>
>>>> After you have done HCatOutputFormat.setOutput(), you can do
>>>> HCatOutputFormat.getTableSchema() which will return you the schema of table
>>>> which you can then use without requiring you to manually construct the
>>>> Schema.
>>>>
>>>> Hope it helps,
>>>> Ashutosh
>>>>
>>>> On Mon, Oct 31, 2011 at 20:18, Charles Menguy <
>>>> [email protected]> wrote:
>>>>
>>>> Hi Ashutosh,
>>>>
>>>> Thank you very much for your answer.
>>>>
>>>> I can certainly understand your argument. Is there however a way to get
>>>> the schema from the output table, so we could potentially create a
>>>> dynamic mapping of fields you want to write to and the actual schema? If
>>>> not, is there any standard way to be able to accomplish what I described,
>>>> other than hardcoding the positions of the columns in the code (bad for
>>>> code reusability)? Any alternative would be helpful as well.
>>>>
>>>> Thanks in advance !
>>>>
>>>> Charles
>>>>
>>>> On Mon, Oct 31, 2011 at 8:37 PM, Ashutosh Chauhan <[email protected]>
>>>> wrote:
>>>>
>>>> Hey Charles,
>>>>
>>>> Yeah, you need to call setOutputSchema() on
>>>> HCatOutputFormat explicitly. Though we could assume defaults we don't
>>>> because of the following reason. While writing rows they may either contain
>>>> partition columns or they may not. HCatOutputFormat will transparently weed
>>>> out partition columns if they are present in the row. If we assume defaults
>>>> then we have to assume that data does not contain partition columns (we
>>>> dont store partition columns in data) which is a dangerous assumption to
>>>> make which will screw things up when we read back. So, instead we ask user
>>>> to set the schema. You are also correct order of columns should be same as
>>>> the one you have declared while creating tables.
>>>>
>>>> Hope it helps,
>>>> Ashutosh
>>>>
>>>>
>>>> On Mon, Oct 31, 2011 at 14:54, Charles Menguy <
>>>> [email protected]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I've been playing with HCatalog for the past couple weeks now, and I
>>>> have a few questions regarding schemas in MR jobs.
>>>>
>>>> From what I read in the documentation, schemas are optional, and if not
>>>> specified it defaults to the table level schemas. Here are some extracts
>>>> from the documentation:
>>>> You can use the setOutputSchema method to include a projection schema,
>>>> to specify specific output fields. If a schema is not specified, this
>>>> default to the table level schema.
>>>> The schema for the data being written out is specified by the
>>>> setSchema method. If this is not called on the HCatOutputFormat, then
>>>> by default it is assumed that the the partition has the same schema as the
>>>> current table level schema
>>>>
>>>> Now when I try to omit the schema for HCatInputFormat, it works fine
>>>> and assumes the default.
>>>> But when I try to omit the schema for HCatOutputFormat, I get the
>>>> following error: org.apache.hcatalog.common.HCatException : 9001 :
>>>> Exception occurred while processing HCat request : It seems that
>>>> setSchema() is not called on HCatOutputFormat. Please make sure that method
>>>> is called.
>>>> From what I read, it expects that I explicitely define the schema with
>>>> HCatOutputFormat.setSchema(...), but this is exactly what I would like to
>>>> omit to assume defaults.
>>>>
>>>> This is actually important because it seems that to define the schema,
>>>> you have to know the order of your table columns in which you specify your
>>>> List<HCatFieldSchema>, which may not always be obvious.
>>>>
>>>> Here is how I create my output table in Hive, which works fine when I'm
>>>> manipulating it while specifying the schema:
>>>> hive> create table inventory(word STRING, author STRING, frequency INT)
>>>> stored as RCFILE;
>>>>
>>>> I would like to know if I'm doing something wrong, or if this is simply
>>>> something not yet implemented in 0.2? Any thoughts would be useful.
>>>>
>>>> Thanks,
>>>>
>>>> Charles
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>

Re: HCatOutputFormat schema issues

Reply via email to