[
https://issues.apache.org/jira/browse/HIVE-17394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163704#comment-16163704
]
Carl Steinbach commented on HIVE-17394:
---------------------------------------
The four test failures were already present in previous builds, so this looks
like a clean run.
> AvroSerde is regenerating TypeInfo objects for each nullable Avro field for
> every row
> -------------------------------------------------------------------------------------
>
> Key: HIVE-17394
> URL: https://issues.apache.org/jira/browse/HIVE-17394
> Project: Hive
> Issue Type: Bug
> Components: Serializers/Deserializers
> Affects Versions: 1.1.0, 3.0.0
> Reporter: Ratandeep Ratti
> Assignee: Anthony Hsu
> Attachments: AvroSerDe.nps, AvroSerDeUnionTypeInfo.png,
> HIVE-17394.1.patch
>
>
> The following methods in {{AvroDeserializer}} keeps regenerating {{TypeInfo}}
> objects for every nullable field in a row.
> This is happening in the following methods.
> {code}
> private Object deserializeNullableUnion(Object datum, Schema fileSchema,
> Schema recordSchema) throws AvroSerdeException {
> // elided
> line 312: return worker(datum, fileSchema, newRecordSchema,
> SchemaToTypeInfo.generateTypeInfo(newRecordSchema, null));
> }
> ..
> private Object deserializeSingleItemNullableUnion(Object datum, Schema Schema
> recordSchema)
> // elided
> line 357: return worker(datum, currentFileSchema, schema,
> SchemaToTypeInfo.generateTypeInfo(schema, null));
> {code}
> This is really bad in terms of performance. I'm not sure why didn't we use
> the TypeInfo we already have instead of generating again for each nullable
> field. If you look at the {{worker}} method which calls the method
> {{deserializeNullableUnion}} the typeInfo corresponding to the nullable field
> column is already determined.
> Moreover the cache in {{SchemaToTypeInfo}} class does not help in nullable
> Avro records case as checking if an Avro record schema object already exists
> in the cache requires traversing all the fields in the record schema.
> I've attached profiling snapshot which shows maximum time is being spent in
> the cache.
> One way of fixing this IMO might be to make use of the column TypeInfo which
> is already passed in the worker method.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)