Sure. Try it out and let us know how it goes. In the meanwhile, we will get docs fixed.
Ashutosh On Tue, Nov 1, 2011 at 10:59, Charles Menguy <[email protected]>wrote: > Thanks for the information Ashutosh, I'll try what you're suggesting but > this sounds like a good solution for now. > > And yes I agree with Thomas, it would be a good idea to fix the following > line in the documentation as this is pretty confusing: > The schema for the data being written out is specified by the setSchema > method. > If this is not called on the HCatOutputFormat, then by default it is > assumed that the the partition has the same schema as the current table > level schema. > > Thanks for the help ! > > Charles > > On Tue, Nov 1, 2011 at 1:42 PM, Thomas Weise <[email protected]> wrote: > >> We should fix the documentation then? >> >> http://incubator.apache.org/hcatalog/docs/r0.2.0/inputoutput.html >> >> >> >> On 11/1/11 9:13 AM, "Ashutosh Chauhan" <[email protected]> wrote: >> >> Hey Charles, >> >> After you have done HCatOutputFormat.setOutput(), you can do >> HCatOutputFormat.getTableSchema() which will return you the schema of table >> which you can then use without requiring you to manually construct the >> Schema. >> >> Hope it helps, >> Ashutosh >> >> On Mon, Oct 31, 2011 at 20:18, Charles Menguy < >> [email protected]> wrote: >> >> Hi Ashutosh, >> >> Thank you very much for your answer. >> >> I can certainly understand your argument. Is there however a way to get >> the schema from the output table, so we could potentially create a >> dynamic mapping of fields you want to write to and the actual schema? If >> not, is there any standard way to be able to accomplish what I described, >> other than hardcoding the positions of the columns in the code (bad for >> code reusability)? Any alternative would be helpful as well. >> >> Thanks in advance ! >> >> Charles >> >> On Mon, Oct 31, 2011 at 8:37 PM, Ashutosh Chauhan <[email protected]> >> wrote: >> >> Hey Charles, >> >> Yeah, you need to call setOutputSchema() on HCatOutputFormat explicitly. >> Though we could assume defaults we don't because of the following reason. >> While writing rows they may either contain partition columns or they may >> not. HCatOutputFormat will transparently weed out partition columns if they >> are present in the row. If we assume defaults then we have to assume that >> data does not contain partition columns (we dont store partition columns in >> data) which is a dangerous assumption to make which will screw things up >> when we read back. So, instead we ask user to set the schema. You are also >> correct order of columns should be same as the one you have declared while >> creating tables. >> >> Hope it helps, >> Ashutosh >> >> >> On Mon, Oct 31, 2011 at 14:54, Charles Menguy < >> [email protected]> wrote: >> >> Hi, >> >> I've been playing with HCatalog for the past couple weeks now, and I have >> a few questions regarding schemas in MR jobs. >> >> From what I read in the documentation, schemas are optional, and if not >> specified it defaults to the table level schemas. Here are some extracts >> from the documentation: >> You can use the setOutputSchema method to include a projection schema, >> to specify specific output fields. If a schema is not specified, this >> default to the table level schema. >> The schema for the data being written out is specified by the setSchema >> method. >> If this is not called on the HCatOutputFormat, then by default it is >> assumed that the the partition has the same schema as the current table >> level schema >> >> Now when I try to omit the schema for HCatInputFormat, it works fine and >> assumes the default. >> But when I try to omit the schema for HCatOutputFormat, I get the >> following error: org.apache.hcatalog.common.HCatException : 9001 : >> Exception occurred while processing HCat request : It seems that >> setSchema() is not called on HCatOutputFormat. Please make sure that method >> is called. >> From what I read, it expects that I explicitely define the schema with >> HCatOutputFormat.setSchema(...), but this is exactly what I would like to >> omit to assume defaults. >> >> This is actually important because it seems that to define the schema, >> you have to know the order of your table columns in which you specify your >> List<HCatFieldSchema>, which may not always be obvious. >> >> Here is how I create my output table in Hive, which works fine when I'm >> manipulating it while specifying the schema: >> hive> create table inventory(word STRING, author STRING, frequency INT) >> stored as RCFILE; >> >> I would like to know if I'm doing something wrong, or if this is simply >> something not yet implemented in 0.2? Any thoughts would be useful. >> >> Thanks, >> >> Charles >> >> >> >> >> >> >> > >
