Re: column names from Object Inspector in serialize() method of custom serde

Ashutosh Chauhan Thu, 27 May 2010 23:10:23 -0700

Thanks John and Arvind. Your explanation make sense. Early days of
getting used to "hive way" of doing things in the world of serdes,
storage-handler, meta-hook, object-inspectors etc. :)


Ashutosh

On Thu, May 27, 2010 at 10:56, Arvind Prabhakar <arv...@cloudera.com> wrote:
> John, Ashutosh,
>
> I agree with John's evaluation on this. Consider the case of writing to a
> partition of a table. Clearly, the columns being written to will not be the
> same as what are defined in the metadata for the entire table. Moreover,
> there are cases where intermediate tables (files) may be produced during a
> particular operation which are not defined by the user. In such cases you
> are dealing with either a subset of columns of a table or columns of an
> intermediate transient table. And since Struct OIs insist on having names
> for fields, it follows that to cover the general case we can use any unique
> names where necessary.
>
> The actual data pipeline underneath the Hive query is already semantically
> verified to fit the appropriate type definitions and hence adding the column
> names would not add any value to the runtime. It will add to the overall
> processing overhead.
>
> Arvind
>
> On Wed, May 26, 2010 at 6:29 PM, John Sichi <jsi...@facebook.com> wrote:
>
>> Hey Ashutosh,
>>
>> You're right, currently the target table column names come in via
>> initialize in the Properties parameter, e.g.
>> props.getProperty(Constants.LIST_COLUMNS), whereas the object inspector gets
>> _col1, _col2, _col3.  (And of course, if you have a custom mapping string
>> like HBase, then that comes in through the initialize Properties parameter
>> via your own private property name.)
>>
>> I haven't looked into the details of why this is, but probably the object
>> inspector references an internally produced row from whatever was upstream
>> (rather than being derived from the target table itself, although the number
>> of columns has to match).  I'm not sure this is a bug per se, just something
>> to be aware of.  In general, you should try to precompute any data
>> structures needed during initialize so that serialize can be as lean as
>> possible, meaning you probably don't want to be looking at the field names
>> in there anyway.
>>
>> Opinions from other hive devs?
>>
>> JVS
>>
>> On May 21, 2010, at 12:22 PM, Ashutosh Chauhan wrote:
>>
>> > Hi,
>> >
>> > I am writing my own custom serde to write data to an external table.
>> > In serialize() method of my serde I am handed over an object and an
>> > object Inspector. Since this object represents a row, I make an
>> > assumption that object Inspector is of type StructObjectInspector and
>> > then I get fields out of this struct using struct Object inspector.
>> > When I do field.getFieldName() on it I expect it will give me the real
>> > column name as contained in my table schema in metastore. But, instead
>> > I get names like _col1, _col2, _col3 ..
>> >
>> > Now the workaround for it is to store the column names in a list in
>> > initialize() method and then use that list to get names in
>> > serialize(). This is what I am doing now and it works. It seems hbase
>> > serde is also doing similar thing. But, it was counter intuitive to me
>> > not to expect to get the real column names in getFieldName() but
>> > rather some random made up names. If this is not the expected behavior
>> > then potentially I am doing something wrong in my serde.. if so I will
>> > appreciate if some one confirms that.. But if this is how things are
>> > implemented currently.. then I think its a bug and I will open a jira
>> > for it..
>> >
>> > Thanks,
>> > Ashutosh
>> >
>> > PS: I am posting it on dev-list But if folks think its more
>> > appropriate for user-list, feel free to move it there, while replying
>> > to it.
>>
>>
>

Re: column names from Object Inspector in serialize() method of custom serde

Reply via email to