[ 
https://issues.apache.org/jira/browse/NIFI-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josiah Johnston updated NIFI-9462:
----------------------------------
    Description: 
I use GenerateFlowFile with this JSON line content, and send it through an 
UpdateRecord processor that uses a JsonTreeReader with schema inference. The 
UpdateRecord processor adds a top level `s3_key` element with the filename.

{{{"_source": \{"name": "battery-voltage-changed", "metadata": {"voltage": 
2.8}}}}}
{{{"_source": \{"name": "temperature-changed", "metadata": {"temperature": 
19.54}}}}}
{{{"other_L1_keys": "are_preserved", "_source": \{"other_L2_keys": 
"are_preserved", "metadata": {"voltage": 6.3, "other_L3_keys": "are_lost"}}}}}

In the output, the structure of `_source.metadata.*` is always strictly based 
on the first record, causing data loss for subsequent records that have 
different fields.

{{{"_source":\{"name":"battery-voltage-changed","metadata":{"voltage":2.8},"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}}}
{{{"_source":\{"name":"temperature-changed","metadata":{"voltage":null},"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}}}
{{{"_source":\{"name":null,"metadata":{"voltage":6.3},"other_L2_keys":"are_preserved"},"other_L1_keys":"are_preserved","s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}}}

 

 In general it drops all 3rd level keys weren't seen in the first record 
(_source.metadata.temperature in record 2, _source.metadata.other_L3_keys in 
record 3). This behavior only applies to keys in the 3rd level; schema 
inference works as documented (scanning through all records) for alternative 
keys in the 1st & 2nd level.

This behavior persists whether I specify the input as JSON lines (shown in this 
example), or if I rearrange it to be a JSON array.

I've attached screenshots of a minimal example and settings of JSON reader & 
writer. 

  was:
I generate a flow file with this JSON line content

{{{"_source": \{"name": "battery-voltage-changed", "metadata": {"voltage": 
2.8}}}}}
{{{"_source": \{"name": "temperature-changed", "foo": 3, "metadata": 
{"temperature": 19.54}}}}}

 

Note that the data fields are nested under metadata (3 levels deep), and differ 
between records. 

I send it through an UpdateRecord processor that adds the file name as a 
top-level key of each record. 

The output appears to build a schema from the first record and applies it to 
all subsequent records (ex: temperature in 2nd record is clobbered).

 

{{{"_source":\{"name":"battery-voltage-changed","metadata":{"voltage":2.8},"foo":null},"s3file":"2ec085a0-bbdf-4bc3-88da-de9cd85e44e9"}}}
{{{"_source":\{"name":"temperature-changed","metadata":{"voltage":null},"foo":3},"s3file":"2ec085a0-bbdf-4bc3-88da-de9cd85e44e9"}}}

 

Note that the `foo` field which appears as a 2nd level key in the 2nd record is 
preserved, but the `temperature` field in the 3rd level is clobbered.

This is a simplified version of a real data source, and I have no control over 
the schema. I've validated this happens to all records in each file in my real 
use case. 

I've attached screenshots of a minimal example and settings of JSON reader & 
writer. 

        Summary: JsonTreeReader schema inference only examines first record for 
parts of structure, causing data loss for subsequent records  (was: JSON Tree 
Reader mangles nested portions of schema)

> JsonTreeReader schema inference only examines first record for parts of 
> structure, causing data loss for subsequent records
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-9462
>                 URL: https://issues.apache.org/jira/browse/NIFI-9462
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.13.0
>            Reporter: Josiah Johnston
>            Priority: Major
>         Attachments: JSON record set writer.png, JSON tree reader.png, 
> flow.png, updateRecord.png
>
>
> I use GenerateFlowFile with this JSON line content, and send it through an 
> UpdateRecord processor that uses a JsonTreeReader with schema inference. The 
> UpdateRecord processor adds a top level `s3_key` element with the filename.
> {{{"_source": \{"name": "battery-voltage-changed", "metadata": {"voltage": 
> 2.8}}}}}
> {{{"_source": \{"name": "temperature-changed", "metadata": {"temperature": 
> 19.54}}}}}
> {{{"other_L1_keys": "are_preserved", "_source": \{"other_L2_keys": 
> "are_preserved", "metadata": {"voltage": 6.3, "other_L3_keys": "are_lost"}}}}}
> In the output, the structure of `_source.metadata.*` is always strictly based 
> on the first record, causing data loss for subsequent records that have 
> different fields.
> {{{"_source":\{"name":"battery-voltage-changed","metadata":{"voltage":2.8},"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}}}
> {{{"_source":\{"name":"temperature-changed","metadata":{"voltage":null},"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}}}
> {{{"_source":\{"name":null,"metadata":{"voltage":6.3},"other_L2_keys":"are_preserved"},"other_L1_keys":"are_preserved","s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}}}
>  
>  In general it drops all 3rd level keys weren't seen in the first record 
> (_source.metadata.temperature in record 2, _source.metadata.other_L3_keys in 
> record 3). This behavior only applies to keys in the 3rd level; schema 
> inference works as documented (scanning through all records) for alternative 
> keys in the 1st & 2nd level.
> This behavior persists whether I specify the input as JSON lines (shown in 
> this example), or if I rearrange it to be a JSON array.
> I've attached screenshots of a minimal example and settings of JSON reader & 
> writer. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to