Hi Ashutosh, Thank you very much for your answer.
I can certainly understand your argument. Is there however a way to get the schema from the output table, so we could potentially create a dynamic mapping of fields you want to write to and the actual schema? If not, is there any standard way to be able to accomplish what I described, other than hardcoding the positions of the columns in the code (bad for code reusability)? Any alternative would be helpful as well. Thanks in advance ! Charles On Mon, Oct 31, 2011 at 8:37 PM, Ashutosh Chauhan <[email protected]>wrote: > Hey Charles, > > Yeah, you need to call setOutputSchema() on HCatOutputFormat explicitly. > Though we could assume defaults we don't because of the following reason. > While writing rows they may either contain partition columns or they may > not. HCatOutputFormat will transparently weed out partition columns if they > are present in the row. If we assume defaults then we have to assume that > data does not contain partition columns (we dont store partition columns in > data) which is a dangerous assumption to make which will screw things up > when we read back. So, instead we ask user to set the schema. You are also > correct order of columns should be same as the one you have declared while > creating tables. > > Hope it helps, > Ashutosh > > > On Mon, Oct 31, 2011 at 14:54, Charles Menguy < > [email protected]> wrote: > >> Hi, >> >> I've been playing with HCatalog for the past couple weeks now, and I have >> a few questions regarding schemas in MR jobs. >> >> From what I read in the documentation, schemas are optional, and if not >> specified it defaults to the table level schemas. Here are some extracts >> from the documentation: >> You can use the setOutputSchema method to include a projection schema, >> to specify specific output fields. If a schema is not specified, this >> default to the table level schema. >> The schema for the data being written out is specified by the setSchema >> method. >> If this is not called on the HCatOutputFormat, then by default it is >> assumed that the the partition has the same schema as the current table >> level schema >> >> Now when I try to omit the schema for HCatInputFormat, it works fine and >> assumes the default. >> But when I try to omit the schema for HCatOutputFormat, I get the >> following error: org.apache.hcatalog.common.HCatException : 9001 : >> Exception occurred while processing HCat request : It seems that >> setSchema() is not called on HCatOutputFormat. Please make sure that method >> is called. >> From what I read, it expects that I explicitely define the schema with >> HCatOutputFormat.setSchema(...), but this is exactly what I would like to >> omit to assume defaults. >> >> This is actually important because it seems that to define the schema, >> you have to know the order of your table columns in which you specify your >> List<HCatFieldSchema>, which may not always be obvious. >> >> Here is how I create my output table in Hive, which works fine when I'm >> manipulating it while specifying the schema: >> hive> create table inventory(word STRING, author STRING, frequency INT) >> stored as RCFILE; >> >> I would like to know if I'm doing something wrong, or if this is simply >> something not yet implemented in 0.2? Any thoughts would be useful. >> >> Thanks, >> >> Charles >> >> >
