Thanks John and Arvind. Your explanation make sense. Early days of getting used to "hive way" of doing things in the world of serdes, storage-handler, meta-hook, object-inspectors etc. :)
Ashutosh On Thu, May 27, 2010 at 10:56, Arvind Prabhakar <arv...@cloudera.com> wrote: > John, Ashutosh, > > I agree with John's evaluation on this. Consider the case of writing to a > partition of a table. Clearly, the columns being written to will not be the > same as what are defined in the metadata for the entire table. Moreover, > there are cases where intermediate tables (files) may be produced during a > particular operation which are not defined by the user. In such cases you > are dealing with either a subset of columns of a table or columns of an > intermediate transient table. And since Struct OIs insist on having names > for fields, it follows that to cover the general case we can use any unique > names where necessary. > > The actual data pipeline underneath the Hive query is already semantically > verified to fit the appropriate type definitions and hence adding the column > names would not add any value to the runtime. It will add to the overall > processing overhead. > > Arvind > > On Wed, May 26, 2010 at 6:29 PM, John Sichi <jsi...@facebook.com> wrote: > >> Hey Ashutosh, >> >> You're right, currently the target table column names come in via >> initialize in the Properties parameter, e.g. >> props.getProperty(Constants.LIST_COLUMNS), whereas the object inspector gets >> _col1, _col2, _col3. (And of course, if you have a custom mapping string >> like HBase, then that comes in through the initialize Properties parameter >> via your own private property name.) >> >> I haven't looked into the details of why this is, but probably the object >> inspector references an internally produced row from whatever was upstream >> (rather than being derived from the target table itself, although the number >> of columns has to match). I'm not sure this is a bug per se, just something >> to be aware of. In general, you should try to precompute any data >> structures needed during initialize so that serialize can be as lean as >> possible, meaning you probably don't want to be looking at the field names >> in there anyway. >> >> Opinions from other hive devs? >> >> JVS >> >> On May 21, 2010, at 12:22 PM, Ashutosh Chauhan wrote: >> >> > Hi, >> > >> > I am writing my own custom serde to write data to an external table. >> > In serialize() method of my serde I am handed over an object and an >> > object Inspector. Since this object represents a row, I make an >> > assumption that object Inspector is of type StructObjectInspector and >> > then I get fields out of this struct using struct Object inspector. >> > When I do field.getFieldName() on it I expect it will give me the real >> > column name as contained in my table schema in metastore. But, instead >> > I get names like _col1, _col2, _col3 .. >> > >> > Now the workaround for it is to store the column names in a list in >> > initialize() method and then use that list to get names in >> > serialize(). This is what I am doing now and it works. It seems hbase >> > serde is also doing similar thing. But, it was counter intuitive to me >> > not to expect to get the real column names in getFieldName() but >> > rather some random made up names. If this is not the expected behavior >> > then potentially I am doing something wrong in my serde.. if so I will >> > appreciate if some one confirms that.. But if this is how things are >> > implemented currently.. then I think its a bug and I will open a jira >> > for it.. >> > >> > Thanks, >> > Ashutosh >> > >> > PS: I am posting it on dev-list But if folks think its more >> > appropriate for user-list, feel free to move it there, while replying >> > to it. >> >> >