You can get the partition columns as follows: HCatOutputFormat.setOutput() HCatOutputFormat.getTableSchema() // gets you the data columns HCatOutputFormat.getJobInfo().getTableInfo().getPartitionColumns() // this will get you partition columns.
Hope it helps, Ashutosh On Wed, Nov 2, 2011 at 07:17, Charles Menguy <[email protected]>wrote: > This works fine when using a non partitioned table, I can just set the > schema to the schema of the table using something like > HCatOutputFormat.setSchema(job, HCatOutputFormat.getTableSchema(job)); > > For a partitioned table however, as you explained, the getTableSchema call > will only return the non partition column, and this method will fail as > expected, because you have to specifically ask it to add the partition > columns in the schema, and this works fine. For this, I currently manually > add the partitions to the table schema, which is a bit tedious. Is there by > any chance a way to get the list of partition from HCatOutputFormat or > anywhere else, so I can just get the list of partitions from the table > schema, add them to the actual schema, set the schema, and be done? Or will > I still have to do it manually? > > I also noticed that there is no way to get the actual schema from the > HCatOutputFormat. You can get the table schema by calling getTableSchema, > which is great, but I don't see a way to get the actual schema we are > setting this way. This is not critical, but I just wanted to mention it. > > Thanks for the support on this particular issue, that was very helpful ! > > Charles > > On Tue, Nov 1, 2011 at 2:29 PM, Ashutosh Chauhan <[email protected]>wrote: > >> Sure. Try it out and let us know how it goes. In the meanwhile, we will >> get docs fixed. >> >> Ashutosh >> >> On Tue, Nov 1, 2011 at 10:59, Charles Menguy < >> [email protected]> wrote: >> >>> Thanks for the information Ashutosh, I'll try what you're suggesting but >>> this sounds like a good solution for now. >>> >>> And yes I agree with Thomas, it would be a good idea to fix the >>> following line in the documentation as this is pretty confusing: >>> The schema for the data being written out is specified by the setSchema >>> method. >>> If this is not called on the HCatOutputFormat, then by default it is >>> assumed that the the partition has the same schema as the current table >>> level schema. >>> >>> Thanks for the help ! >>> >>> Charles >>> >>> On Tue, Nov 1, 2011 at 1:42 PM, Thomas Weise <[email protected]> wrote: >>> >>>> We should fix the documentation then? >>>> >>>> http://incubator.apache.org/hcatalog/docs/r0.2.0/inputoutput.html >>>> >>>> >>>> >>>> On 11/1/11 9:13 AM, "Ashutosh Chauhan" <[email protected]> wrote: >>>> >>>> Hey Charles, >>>> >>>> After you have done HCatOutputFormat.setOutput(), you can do >>>> HCatOutputFormat.getTableSchema() which will return you the schema of table >>>> which you can then use without requiring you to manually construct the >>>> Schema. >>>> >>>> Hope it helps, >>>> Ashutosh >>>> >>>> On Mon, Oct 31, 2011 at 20:18, Charles Menguy < >>>> [email protected]> wrote: >>>> >>>> Hi Ashutosh, >>>> >>>> Thank you very much for your answer. >>>> >>>> I can certainly understand your argument. Is there however a way to get >>>> the schema from the output table, so we could potentially create a >>>> dynamic mapping of fields you want to write to and the actual schema? If >>>> not, is there any standard way to be able to accomplish what I described, >>>> other than hardcoding the positions of the columns in the code (bad for >>>> code reusability)? Any alternative would be helpful as well. >>>> >>>> Thanks in advance ! >>>> >>>> Charles >>>> >>>> On Mon, Oct 31, 2011 at 8:37 PM, Ashutosh Chauhan <[email protected]> >>>> wrote: >>>> >>>> Hey Charles, >>>> >>>> Yeah, you need to call setOutputSchema() on >>>> HCatOutputFormat explicitly. Though we could assume defaults we don't >>>> because of the following reason. While writing rows they may either contain >>>> partition columns or they may not. HCatOutputFormat will transparently weed >>>> out partition columns if they are present in the row. If we assume defaults >>>> then we have to assume that data does not contain partition columns (we >>>> dont store partition columns in data) which is a dangerous assumption to >>>> make which will screw things up when we read back. So, instead we ask user >>>> to set the schema. You are also correct order of columns should be same as >>>> the one you have declared while creating tables. >>>> >>>> Hope it helps, >>>> Ashutosh >>>> >>>> >>>> On Mon, Oct 31, 2011 at 14:54, Charles Menguy < >>>> [email protected]> wrote: >>>> >>>> Hi, >>>> >>>> I've been playing with HCatalog for the past couple weeks now, and I >>>> have a few questions regarding schemas in MR jobs. >>>> >>>> From what I read in the documentation, schemas are optional, and if not >>>> specified it defaults to the table level schemas. Here are some extracts >>>> from the documentation: >>>> You can use the setOutputSchema method to include a projection schema, >>>> to specify specific output fields. If a schema is not specified, this >>>> default to the table level schema. >>>> The schema for the data being written out is specified by the >>>> setSchema method. If this is not called on the HCatOutputFormat, then >>>> by default it is assumed that the the partition has the same schema as the >>>> current table level schema >>>> >>>> Now when I try to omit the schema for HCatInputFormat, it works fine >>>> and assumes the default. >>>> But when I try to omit the schema for HCatOutputFormat, I get the >>>> following error: org.apache.hcatalog.common.HCatException : 9001 : >>>> Exception occurred while processing HCat request : It seems that >>>> setSchema() is not called on HCatOutputFormat. Please make sure that method >>>> is called. >>>> From what I read, it expects that I explicitely define the schema with >>>> HCatOutputFormat.setSchema(...), but this is exactly what I would like to >>>> omit to assume defaults. >>>> >>>> This is actually important because it seems that to define the schema, >>>> you have to know the order of your table columns in which you specify your >>>> List<HCatFieldSchema>, which may not always be obvious. >>>> >>>> Here is how I create my output table in Hive, which works fine when I'm >>>> manipulating it while specifying the schema: >>>> hive> create table inventory(word STRING, author STRING, frequency INT) >>>> stored as RCFILE; >>>> >>>> I would like to know if I'm doing something wrong, or if this is simply >>>> something not yet implemented in 0.2? Any thoughts would be useful. >>>> >>>> Thanks, >>>> >>>> Charles >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >> > >
