[
https://issues.apache.org/jira/browse/NIFI-5640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630503#comment-16630503
]
Mark Payne commented on NIFI-5640:
----------------------------------
re: Performance. I didn't setup any performance tests specifically for
AvroRecordReader alone. However, I did test ConsumeKafkaRecord_1_0 using the
AvroReader. Before the PR, I was able to pull 60,000 msgs/sec (49 MB/sec) from
Kafka in my particular setup from a single node. After the PR, I was able to
pull 130,000 msgs/sec (106 MB/sec) sustained rate. At that point, my disks were
fully utilized so I was able to pull any faster. Had I used a box with faster
or more disks, I likely would have achieved even better performance. So given
that the entire process of pulling messages from Kafka, parsing the Avro,
turning it into an intermediate record, then writing it out as a JSON FlowFile
more than doubled, this means that the performance of the Avro Reader was
improved by far more than 100%.
> Improve efficiency of Avro Record Reader
> ----------------------------------------
>
> Key: NIFI-5640
> URL: https://issues.apache.org/jira/browse/NIFI-5640
> Project: Apache NiFi
> Issue Type: Improvement
> Reporter: Mark Payne
> Assignee: Mark Payne
> Priority: Major
> Fix For: 1.8.0
>
>
> There are a few things that we are doing in the Avro Reader that cause subpar
> performance. Firstly, in the AvroTypeUtil, when converting an Avro
> GenericRecord to our Record, the building of the RecordSchema is slow because
> we call toString() (which is quite expensive) on the Avro schema in order to
> provide a textual version to RecordSchema. However, the text is typically not
> used and it is optional to provide the schema text, so we should avoid
> calling Schema#toString() whenever possible.
> The AvroTypeUtil class also calls #getNonNullSubSchemas() a lot. In some
> cases we don't really need to do this and can avoid creating the sublist. In
> other cases, we do need to call it. However, the method uses the stream()
> method on an existing List just to filter out 0 or 1 elements. While use of
> the stream() method makes the code very readable, it is quite a bit more
> expensive than just iterating over the existing list and adding to an
> ArrayList. We should avoid use of the {{stream()}} method for trivial pieces
> of code in time-critical parts of the codebase.
> Additionally, I've found that Avro's GenericDatumReader is extremely
> inefficient, at least in some cases, when reading Strings because it uses an
> IdentityHashMap to cache details about the schema. But IdentityHashMap is far
> slower than if it were to just use HashMap so we could subclass the reader in
> order to avoid the slow caching.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)