[ https://issues.apache.org/jira/browse/NIFI-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466504#comment-17466504 ]
Otto Fowler commented on NIFI-9462: ----------------------------------- I believe this is working correctly. When you are passing multiple json structures to the schema inference it is going to assume that they are homogeneous and use the first one. > JsonTreeReader schema inference only examines first record for parts of > structure, causing data loss for subsequent records > --------------------------------------------------------------------------------------------------------------------------- > > Key: NIFI-9462 > URL: https://issues.apache.org/jira/browse/NIFI-9462 > Project: Apache NiFi > Issue Type: Bug > Affects Versions: 1.13.0 > Reporter: Josiah Johnston > Priority: Major > Attachments: JSON record set writer.png, JSON tree reader.png, > NIFI-9462_example.xml, flow.png, updateRecord.png > > > I use GenerateFlowFile with this JSON line content, and send it through an > UpdateRecord processor that uses a JsonTreeReader with schema inference. The > UpdateRecord processor adds a top level `s3_key` element with the filename. > {"_source": {"name": "battery-voltage-changed", "metadata": > {"voltage": 2.8} > }} > {"_source": {"name": "temperature-changed", "metadata": > {"temperature": 19.54} > }} > {"other_L1_keys": "are_preserved", "_source": {"other_L2_keys": > "are_preserved", "metadata": > {"voltage": 6.3, "other_L3_keys": "are_lost"} > }} > In the output, the structure of `_source.metadata.*` is always strictly based > on the first record, causing data loss for subsequent records that have > different fields. > {"_source":{"name":"battery-voltage-changed","metadata": > {"voltage":2.8} > ,"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"} > {"_source":{"name":"temperature-changed","metadata": > {"voltage":null} > ,"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"} > {"_source":{"name":null,"metadata": > {"voltage":6.3} > ,"other_L2_keys":"are_preserved"},"other_L1_keys":"are_preserved","s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"} > In general it drops all 3rd level keys weren't seen in the first record > (_source.metadata.temperature in record 2, _source.metadata.other_L3_keys in > record 3). This behavior only applies to keys in the 3rd level; schema > inference works as documented (scanning through all records) for alternative > keys in the 1st & 2nd level. > This behavior persists whether I specify the input as JSON lines (shown in > this example), or if I rearrange it to be a JSON array. > I've attached screenshots of a minimal example and settings of JSON reader & > writer. I've also attached a template of a minimal example. If you import it, > you'll need to create & enable controller services of JsonTreeReader & > JsonRecordSetWriter (default values for each). -- This message was sent by Atlassian Jira (v8.20.1#820001)