Mark Payne created NIFI-15745:
---------------------------------
Summary: Schema Inference is very inefficient when complex inner
fields have many nullable values
Key: NIFI-15745
URL: https://issues.apache.org/jira/browse/NIFI-15745
Project: Apache NiFi
Issue Type: Improvement
Components: Extensions
Reporter: Mark Payne
Assignee: Mark Payne
When we have records with inner "records" / "objects" and we're inferring
schema over many records, if some of the inner fields are nullable and
therefore not present (especially common in JSON) our inference creates a UNION
of record types. For example, if we had:
{code:java}
[{
"name": "Mark",
"project": {
"name": "nifi",
"org": "The Apache Software Foundation",
"yearEstablished": 2014
}
},
{
"name": "John",
"project": {
"name": "nifi",
"language": "Java",
"jiraProject": "NIFI"
},
"language": {
"name": "Java"
}
}] {code}
Each of these records has an inner-record with nullable fields so the schema
would define project as a {{UNION}} of two Record fields.
This works okay for a simple example like this. But consider a FlowFile with
thousands or tens of thousands of Records, where inner objects can be very
complex. The UNION becomes massive, and it takes an inordinate amount of time
to infer the schema.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)