Mark Payne created NIFI-5640:
--------------------------------
Summary: Improve efficiency of Avro Record Reader
Key: NIFI-5640
URL: https://issues.apache.org/jira/browse/NIFI-5640
Project: Apache NiFi
Issue Type: Improvement
Reporter: Mark Payne
Assignee: Mark Payne
There are a few things that we are doing in the Avro Reader that cause subpar
performance. Firstly, in the AvroTypeUtil, when converting an Avro
GenericRecord to our Record, the building of the RecordSchema is slow because
we call toString() (which is quite expensive) on the Avro schema in order to
provide a textual version to RecordSchema. However, the text is typically not
used and it is optional to provide the schema text, so we should avoid calling
Schema#toString() whenever possible.
The AvroTypeUtil class also calls #getNonNullSubSchemas() a lot. In some cases
we don't really need to do this and can avoid creating the sublist. In other
cases, we do need to call it. However, the method uses the stream() method on
an existing List just to filter out 0 or 1 elements. While use of the stream()
method makes the code very readable, it is quite a bit more expensive than just
iterating over the existing list and adding to an ArrayList. We should avoid
use of the {{stream()}} method for trivial pieces of code in time-critical
parts of the codebase.
Additionally, I've found that Avro's GenericDatumReader is extremely
inefficient, at least in some cases, when reading Strings because it uses an
IdentityHashMap to cache details about the schema. But IdentityHashMap is far
slower than if it were to just use HashMap so we could subclass the reader in
order to avoid the slow caching.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)