[
https://issues.apache.org/jira/browse/NIFI-5640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630956#comment-16630956
]
ASF GitHub Bot commented on NIFI-5640:
--------------------------------------
Github user mattyb149 commented on the issue:
https://github.com/apache/nifi/pull/3036
+1 LGTM, ran full build and tried various Avro conversions,
reading/writing. Thanks for the improvement! Merging to master
> Improve efficiency of Avro Record Reader
> ----------------------------------------
>
> Key: NIFI-5640
> URL: https://issues.apache.org/jira/browse/NIFI-5640
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Fix For: 1.8.0
>
>
> There are a few things that we are doing in the Avro Reader that cause subpar
> performance. Firstly, in the AvroTypeUtil, when converting an Avro
> GenericRecord to our Record, the building of the RecordSchema is slow because
> we call toString() (which is quite expensive) on the Avro schema in order to
> provide a textual version to RecordSchema. However, the text is typically not
> used and it is optional to provide the schema text, so we should avoid
> calling Schema#toString() whenever possible.
> The AvroTypeUtil class also calls #getNonNullSubSchemas() a lot. In some
> cases we don't really need to do this and can avoid creating the sublist. In
> other cases, we do need to call it. However, the method uses the stream()
> method on an existing List just to filter out 0 or 1 elements. While use of
> the stream() method makes the code very readable, it is quite a bit more
> expensive than just iterating over the existing list and adding to an
> ArrayList. We should avoid use of the {{stream()}} method for trivial pieces
> of code in time-critical parts of the codebase.
> Additionally, I've found that Avro's GenericDatumReader is extremely
> inefficient, at least in some cases, when reading Strings because it uses an
> IdentityHashMap to cache details about the schema. But IdentityHashMap is far
> slower than if it were to just use HashMap so we could subclass the reader in
> order to avoid the slow caching.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)