[ 
https://issues.apache.org/jira/browse/NIFI-16067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sönke Liebau updated NIFI-16067:
--------------------------------
    Description: 
h3. Description

PutIcebergRecord (added in NIFI-15062) doesn't match columns by name. It writes 
each
record's values into the Iceberg table columns by position, and the position it 
uses is
the one from the *incoming* record schema, not the table schema. So the mapping 
is wrong
whenever the input field order doesn't line up with the table's column order.

If the record has fewer fields than the table, the missing columns aren't set 
to NULL
either - the values just shift left into whatever table column happens to sit 
at that
position. Sometimes that lands on an incompatible type and you get a 
ClassCastException,
which is the good outcome because at least it fails loudly. When the types 
happen to line
up, the row is written with values in the wrong columns and nothing complains.

This gets worse with schema evolution: as soon as you add a column, the 
producer and the
table no longer agree on the column set, and existing flows start silently 
writing data
into the wrong place.
h3. Steps to reproduce

Create an Iceberg table whose column order differs from the incoming record (or 
add a
new column so the record has fewer fields than the table), then push records 
through
PutIcebergRecord into it. Depending on the type alignment you'll either see a
ClassCastException or rows written into the wrong columns.
h3. Root cause

{{DelegatedRecord.get(int position)}} looks the position up against the input 
record
schema instead of the Iceberg struct the wrapper already holds:

nifi-iceberg-processors/.../record/DelegatedRecord.java, lines 78-82:
{code:java}
@Override
public Object get(final int position) {
    final RecordField recordField = record.getSchema().getField(position); // 
input schema
    return record.getValue(recordField);
}
{code}
The Iceberg write path walks the table struct position by position and calls 
`get(i)`
for column i, but '{{{}get(i)'{}}} hands back input field i, so column i ends 
up with input
column i's value no matter what it's named. The struct is only exposed through
{{struct()}} and never used for the actual lookup. `{{{}get(int, Class)`{}}} 
just delegates to `{{{}get(int)`{}}}, so it's wrong too.
h3. Fix

Look the field up by name instead of id, something along the lines below:
{code:java}
@Override
public Object get(final int position) {
    final Types.NestedField field = struct.fields().get(position);
    return record.getValue(field.name());
}
{code}
We are happy to open a PR for this if the issue is confirmed.
h3. Related

NIFI-14162 and NIFI-14614 are the same ordinal-vs-name problem in QueryRecord 
(both
fixed for 2.7.0). This is the Iceberg-bundle version of it.

  was:
h3. Description

PutIcebergRecord (added in NIFI-15062) doesn't match columns by name. It writes 
each
record's values into the Iceberg table columns by position, and the position it 
uses is
the one from the *incoming* record schema, not the table schema. So the mapping 
is wrong
whenever the input field order doesn't line up with the table's column order.

If the record has fewer fields than the table, the missing columns aren't set 
to NULL
either - the values just shift left into whatever table column happens to sit 
at that
position. Sometimes that lands on an incompatible type and you get a 
ClassCastException,
which is the good outcome because at least it fails loudly. When the types 
happen to line
up, the row is written with values in the wrong columns and nothing complains.

This gets worse with schema evolution: as soon as you add a column, the 
producer and the
table no longer agree on the column set, and existing flows start silently 
writing data
into the wrong place.
h3. Steps to reproduce

Create an Iceberg table whose column order differs from the incoming record (or 
add a
new column so the record has fewer fields than the table), then push records 
through
PutIcebergRecord into it. Depending on the type alignment you'll either see a
ClassCastException or rows written into the wrong columns.
h3. Root cause

{{DelegatedRecord.get(int position)}} looks the position up against the input 
record
schema instead of the Iceberg struct the wrapper already holds:

nifi-iceberg-processors/.../record/DelegatedRecord.java, lines 78-82:
{code:java}
@Override
public Object get(final int position) {
    final RecordField recordField = record.getSchema().getField(position); // 
input schema
    return record.getValue(recordField);
}
{code}
The Iceberg write path walks the table struct position by position and calls 
{{`get(i)`}}
for column i, but {{`get(i)`}} hands back input field i, so column i ends up 
with input
column i's value no matter what it's named. The struct is only exposed through
{{struct()}} and never used for the actual lookup. `{{{}get(int, Class)`{}}} 
just delegates to `{{{}get(int)`{}}}, so it's wrong too.
h3. Fix

Look the field up by name instead of id, something along the lines below:
{code:java}
@Override
public Object get(final int position) {
    final Types.NestedField field = struct.fields().get(position);
    return record.getValue(field.name());
}
{code}
We are happy to open a PR for this if the issue is confirmed.
h3. Related

NIFI-14162 and NIFI-14614 are the same ordinal-vs-name problem in QueryRecord 
(both
fixed for 2.7.0). This is the Iceberg-bundle version of it.


> PutIcebergRecord maps fields by ordinal position instead of column name
> -----------------------------------------------------------------------
>
>                 Key: NIFI-16067
>                 URL: https://issues.apache.org/jira/browse/NIFI-16067
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 2.7.0, 2.8.0, 2.7.1, 2.7.2, 2.9.0, 2.10.0
>            Reporter: Sönke Liebau
>            Priority: Major
>
> h3. Description
> PutIcebergRecord (added in NIFI-15062) doesn't match columns by name. It 
> writes each
> record's values into the Iceberg table columns by position, and the position 
> it uses is
> the one from the *incoming* record schema, not the table schema. So the 
> mapping is wrong
> whenever the input field order doesn't line up with the table's column order.
> If the record has fewer fields than the table, the missing columns aren't set 
> to NULL
> either - the values just shift left into whatever table column happens to sit 
> at that
> position. Sometimes that lands on an incompatible type and you get a 
> ClassCastException,
> which is the good outcome because at least it fails loudly. When the types 
> happen to line
> up, the row is written with values in the wrong columns and nothing complains.
> This gets worse with schema evolution: as soon as you add a column, the 
> producer and the
> table no longer agree on the column set, and existing flows start silently 
> writing data
> into the wrong place.
> h3. Steps to reproduce
> Create an Iceberg table whose column order differs from the incoming record 
> (or add a
> new column so the record has fewer fields than the table), then push records 
> through
> PutIcebergRecord into it. Depending on the type alignment you'll either see a
> ClassCastException or rows written into the wrong columns.
> h3. Root cause
> {{DelegatedRecord.get(int position)}} looks the position up against the input 
> record
> schema instead of the Iceberg struct the wrapper already holds:
> nifi-iceberg-processors/.../record/DelegatedRecord.java, lines 78-82:
> {code:java}
> @Override
> public Object get(final int position) {
>     final RecordField recordField = record.getSchema().getField(position); // 
> input schema
>     return record.getValue(recordField);
> }
> {code}
> The Iceberg write path walks the table struct position by position and calls 
> `get(i)`
> for column i, but '{{{}get(i)'{}}} hands back input field i, so column i ends 
> up with input
> column i's value no matter what it's named. The struct is only exposed through
> {{struct()}} and never used for the actual lookup. `{{{}get(int, Class)`{}}} 
> just delegates to `{{{}get(int)`{}}}, so it's wrong too.
> h3. Fix
> Look the field up by name instead of id, something along the lines below:
> {code:java}
> @Override
> public Object get(final int position) {
>     final Types.NestedField field = struct.fields().get(position);
>     return record.getValue(field.name());
> }
> {code}
> We are happy to open a PR for this if the issue is confirmed.
> h3. Related
> NIFI-14162 and NIFI-14614 are the same ordinal-vs-name problem in QueryRecord 
> (both
> fixed for 2.7.0). This is the Iceberg-bundle version of it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to