[jira] [Commented] (NIFI-9462) JsonTreeReader schema inference only examines first record for parts of structure, causing data loss for subsequent records

Otto Fowler (Jira) Wed, 29 Dec 2021 08:13:07 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466504#comment-17466504
 ]


Otto Fowler commented on NIFI-9462:
-----------------------------------

I believe this is working correctly. When you are passing multiple json 
structures to the schema inference it is going to assume that they are  
homogeneous and use the first one.

> JsonTreeReader schema inference only examines first record for parts of 
> structure, causing data loss for subsequent records
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-9462
>                 URL: https://issues.apache.org/jira/browse/NIFI-9462
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.13.0
>            Reporter: Josiah Johnston
>            Priority: Major
>         Attachments: JSON record set writer.png, JSON tree reader.png, 
> NIFI-9462_example.xml, flow.png, updateRecord.png
>
>
> I use GenerateFlowFile with this JSON line content, and send it through an 
> UpdateRecord processor that uses a JsonTreeReader with schema inference. The 
> UpdateRecord processor adds a top level `s3_key` element with the filename.
> {"_source": {"name": "battery-voltage-changed", "metadata":
> {"voltage": 2.8}
> }}
> {"_source": {"name": "temperature-changed", "metadata":
> {"temperature": 19.54}
> }}
> {"other_L1_keys": "are_preserved", "_source": {"other_L2_keys": 
> "are_preserved", "metadata":
> {"voltage": 6.3, "other_L3_keys": "are_lost"}
> }}
> In the output, the structure of `_source.metadata.*` is always strictly based 
> on the first record, causing data loss for subsequent records that have 
> different fields.
> {"_source":{"name":"battery-voltage-changed","metadata":
> {"voltage":2.8}
> ,"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
> {"_source":{"name":"temperature-changed","metadata":
> {"voltage":null}
> ,"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
> {"_source":{"name":null,"metadata":
> {"voltage":6.3}
> ,"other_L2_keys":"are_preserved"},"other_L1_keys":"are_preserved","s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
>  In general it drops all 3rd level keys weren't seen in the first record 
> (_source.metadata.temperature in record 2, _source.metadata.other_L3_keys in 
> record 3). This behavior only applies to keys in the 3rd level; schema 
> inference works as documented (scanning through all records) for alternative 
> keys in the 1st & 2nd level.
> This behavior persists whether I specify the input as JSON lines (shown in 
> this example), or if I rearrange it to be a JSON array.
> I've attached screenshots of a minimal example and settings of JSON reader & 
> writer. I've also attached a template of a minimal example. If you import it, 
> you'll need to create & enable controller services of JsonTreeReader & 
> JsonRecordSetWriter (default values for each).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (NIFI-9462) JsonTreeReader schema inference only examines first record for parts of structure, causing data loss for subsequent records

Reply via email to