[
https://issues.apache.org/jira/browse/NIFI-16067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sönke Liebau updated NIFI-16067:
--------------------------------
Description:
h3. Description
PutIcebergRecord (added in NIFI-15062) doesn't match columns by name. It writes
each
record's values into the Iceberg table columns by position, and the position it
uses is
the one from the *incoming* record schema, not the table schema. So the mapping
is wrong
whenever the input field order doesn't line up with the table's column order.
If the record has fewer fields than the table, the missing columns aren't set
to NULL
either - the values just shift left into whatever table column happens to sit
at that
position. Sometimes that lands on an incompatible type and you get a
ClassCastException,
which is the good outcome because at least it fails loudly. When the types
happen to line
up, the row is written with values in the wrong columns and nothing complains.
This gets worse with schema evolution: as soon as you add a column, the
producer and the
table no longer agree on the column set, and existing flows start silently
writing data
into the wrong place.
h3. Steps to reproduce
Create an Iceberg table whose column order differs from the incoming record (or
add a
new column so the record has fewer fields than the table), then push records
through
PutIcebergRecord into it. Depending on the type alignment you'll either see a
ClassCastException or rows written into the wrong columns.
h3. Root cause
{{DelegatedRecord.get(int position)}} looks the position up against the input
record
schema instead of the Iceberg struct the wrapper already holds:
nifi-iceberg-processors/.../record/DelegatedRecord.java, lines 78-82:
{code:java}
@Override
public Object get(final int position) {
final RecordField recordField = record.getSchema().getField(position); //
input schema
return record.getValue(recordField);
}
{code}
The Iceberg write path walks the table struct position by position and calls
`get( i )`
for column i, but '{{{}get( i )'{}}} hands back input field i, so column i ends
up with input
column i's value no matter what it's named. The struct is only exposed through
{{struct()}} and never used for the actual lookup. `{{{}get(int, Class)`{}}}
just delegates to `{{{}get(int)`{}}}, so it's wrong too.
h3. Fix
Look the field up by name instead of id, something along the lines below:
{code:java}
@Override
public Object get(final int position) {
final Types.NestedField field = struct.fields().get(position);
return record.getValue(field.name());
}
{code}
We are happy to open a PR for this if the issue is confirmed.
h3. Related
NIFI-14162 and NIFI-14614 are the same ordinal-vs-name problem in QueryRecord
(both
fixed for 2.7.0). This is the Iceberg-bundle version of it.
was:
h3. Description
PutIcebergRecord (added in NIFI-15062) doesn't match columns by name. It writes
each
record's values into the Iceberg table columns by position, and the position it
uses is
the one from the *incoming* record schema, not the table schema. So the mapping
is wrong
whenever the input field order doesn't line up with the table's column order.
If the record has fewer fields than the table, the missing columns aren't set
to NULL
either - the values just shift left into whatever table column happens to sit
at that
position. Sometimes that lands on an incompatible type and you get a
ClassCastException,
which is the good outcome because at least it fails loudly. When the types
happen to line
up, the row is written with values in the wrong columns and nothing complains.
This gets worse with schema evolution: as soon as you add a column, the
producer and the
table no longer agree on the column set, and existing flows start silently
writing data
into the wrong place.
h3. Steps to reproduce
Create an Iceberg table whose column order differs from the incoming record (or
add a
new column so the record has fewer fields than the table), then push records
through
PutIcebergRecord into it. Depending on the type alignment you'll either see a
ClassCastException or rows written into the wrong columns.
h3. Root cause
{{DelegatedRecord.get(int position)}} looks the position up against the input
record
schema instead of the Iceberg struct the wrapper already holds:
nifi-iceberg-processors/.../record/DelegatedRecord.java, lines 78-82:
{code:java}
@Override
public Object get(final int position) {
final RecordField recordField = record.getSchema().getField(position); //
input schema
return record.getValue(recordField);
}
{code}
The Iceberg write path walks the table struct position by position and calls
`get( i )`
for column i, but '{{{}get(i)'{}}} hands back input field i, so column i ends
up with input
column i's value no matter what it's named. The struct is only exposed through
{{struct()}} and never used for the actual lookup. `{{{}get(int, Class)`{}}}
just delegates to `{{{}get(int)`{}}}, so it's wrong too.
h3. Fix
Look the field up by name instead of id, something along the lines below:
{code:java}
@Override
public Object get(final int position) {
final Types.NestedField field = struct.fields().get(position);
return record.getValue(field.name());
}
{code}
We are happy to open a PR for this if the issue is confirmed.
h3. Related
NIFI-14162 and NIFI-14614 are the same ordinal-vs-name problem in QueryRecord
(both
fixed for 2.7.0). This is the Iceberg-bundle version of it.
> PutIcebergRecord maps fields by ordinal position instead of column name
> -----------------------------------------------------------------------
>
> Key: NIFI-16067
> URL: https://issues.apache.org/jira/browse/NIFI-16067
> Project: Apache NiFi
> Issue Type: Bug
> Components: Core Framework
> Affects Versions: 2.7.0, 2.8.0, 2.7.1, 2.7.2, 2.9.0, 2.10.0
> Reporter: Sönke Liebau
> Priority: Major
>
> h3. Description
> PutIcebergRecord (added in NIFI-15062) doesn't match columns by name. It
> writes each
> record's values into the Iceberg table columns by position, and the position
> it uses is
> the one from the *incoming* record schema, not the table schema. So the
> mapping is wrong
> whenever the input field order doesn't line up with the table's column order.
> If the record has fewer fields than the table, the missing columns aren't set
> to NULL
> either - the values just shift left into whatever table column happens to sit
> at that
> position. Sometimes that lands on an incompatible type and you get a
> ClassCastException,
> which is the good outcome because at least it fails loudly. When the types
> happen to line
> up, the row is written with values in the wrong columns and nothing complains.
> This gets worse with schema evolution: as soon as you add a column, the
> producer and the
> table no longer agree on the column set, and existing flows start silently
> writing data
> into the wrong place.
> h3. Steps to reproduce
> Create an Iceberg table whose column order differs from the incoming record
> (or add a
> new column so the record has fewer fields than the table), then push records
> through
> PutIcebergRecord into it. Depending on the type alignment you'll either see a
> ClassCastException or rows written into the wrong columns.
> h3. Root cause
> {{DelegatedRecord.get(int position)}} looks the position up against the input
> record
> schema instead of the Iceberg struct the wrapper already holds:
> nifi-iceberg-processors/.../record/DelegatedRecord.java, lines 78-82:
> {code:java}
> @Override
> public Object get(final int position) {
> final RecordField recordField = record.getSchema().getField(position); //
> input schema
> return record.getValue(recordField);
> }
> {code}
> The Iceberg write path walks the table struct position by position and calls
> `get( i )`
> for column i, but '{{{}get( i )'{}}} hands back input field i, so column i
> ends up with input
> column i's value no matter what it's named. The struct is only exposed through
> {{struct()}} and never used for the actual lookup. `{{{}get(int, Class)`{}}}
> just delegates to `{{{}get(int)`{}}}, so it's wrong too.
> h3. Fix
> Look the field up by name instead of id, something along the lines below:
> {code:java}
> @Override
> public Object get(final int position) {
> final Types.NestedField field = struct.fields().get(position);
> return record.getValue(field.name());
> }
> {code}
> We are happy to open a PR for this if the issue is confirmed.
> h3. Related
> NIFI-14162 and NIFI-14614 are the same ordinal-vs-name problem in QueryRecord
> (both
> fixed for 2.7.0). This is the Iceberg-bundle version of it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)