Mark Payne created NIFI-15745:
---------------------------------

             Summary: Schema Inference is very inefficient when complex inner 
fields have many nullable values
                 Key: NIFI-15745
                 URL: https://issues.apache.org/jira/browse/NIFI-15745
             Project: Apache NiFi
          Issue Type: Improvement
          Components: Extensions
            Reporter: Mark Payne
            Assignee: Mark Payne


When we have records with inner "records" / "objects" and we're inferring 
schema over many records, if some of the inner fields are nullable and 
therefore not present (especially common in JSON) our inference creates a UNION 
of record types. For example, if we had:
{code:java}
[{
  "name": "Mark",
  "project": {
    "name": "nifi",
    "org": "The Apache Software Foundation",
    "yearEstablished": 2014
  }
},
{
  "name": "John",
  "project": {
    "name": "nifi",
    "language": "Java",
    "jiraProject": "NIFI"
  },
  "language": {
    "name": "Java"
  }
}] {code}
Each of these records has an inner-record with nullable fields so the schema 
would define project as a {{UNION}} of two Record fields.

This works okay for a simple example like this. But consider a FlowFile with 
thousands or tens of thousands of Records, where inner objects can be very 
complex. The UNION becomes massive, and it takes an inordinate amount of time 
to infer the schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to