rdblue commented on pull request #1744:
URL: https://github.com/apache/iceberg/pull/1744#issuecomment-890564959
If I remember correctly, the main problem here was a mismatch between the
expected schema and the actual schema when reading from Avro files. That was
caused by some unexpected behaviors across a few helpers:
1. `TypeUtil.select` (that uses `PruneColumns`) has behavior like SQL
projection using IDs. Any field that is selected by ID is completely projected.
For example, selecting `{4}` from the schema `3: id bigint, 4: location
struct<5: lat double, 6: long double>` is like `SELECT location` and produces
`4: location struct<5: lat double, 6 long double>` as the projected schema. As
a result, empty structs can't be selected by this method (or in SQL).
2. `TypeUtil.getProjectedIds` (that uses `GetProjectedIds`) is written
expecting the current behavior from `TypeUtil.select`. When finding the IDs
needed to describe a projection, it will return all leaf field IDs. Because
empty structs contain no leaf fields, there are no IDs returned for an empty
struct. Reusing the example above, if the requested projection were `4:
location struct<>` (no fields) then the result is `{}` rather than `{4}`
because `select` would return all fields of `location`.
3. Record types are position based and readers assume that the expected
schema will match the read schema, even though the read schema is produced
using the file schema, conversions, and calls to the two methods above. An
empty struct in the requested schema will be silently removed, producing a
mismatch in the field positions. When reading with `id bigint, location
struct<>, data string`, the data produced will be `id bigint, data string`,
resulting in an `ArrayIndexOutOfBounds` exception or a `ClassCastException`
depending on the rest of the projection.
4. There is a bug that sometimes avoids this problem by accidentally
projecting empty structs, but this is not reliable because sometimes the empty
struct should not be produced or else it introduces the opposite problem:
projecting `id bigint, data string` actually produces `id bigint, location
struct<>, data string`.
Hopefully that captures the issues here.
First, I think that `select` behavior is correct. This matches the behavior
where a list of selected fields should be turned into a list of IDs and
completely projected. Selecting a struct should select all of its fields, as
expected by `SELECT location`. The problem is that this is that we assume that
the output of `getProjectedIds` will be used for `select` and so it can't
return any non-leaf field IDs.
To fix this, I propose the following:
* Update the behavior of `getProjectedIds` to return an ID for empty
structs. The ID should indicate that the struct must be returned without any
fields, unless there are sub-fields selected by a nested field ID.
* Add a `TypeUtil.project` method that is the opposite of `getProjectedIds`.
The combination of `getProjectedIds` and `project` should always be the same
schema, with empty structs.
* Update all calls to `select` and avoid passing the result of
`getProjectedIds` because of the behavior change. In some cases, this will
actually fix behavior: `StructProjection.create(Schema base, Schema
projection)` should use `project` to preserve empty structs, for example. But
other places, like `BaseTableScan.lazyColumnProjection` can avoid calling
`getProjectedIds` because it isn't needed (and would change behavior).
* Update most calls that need to produce a schema matching the expected
schema to use `project` instead of `select` so that empty structs are preserved.
* Fix the second bug, which is unrelated but also incorrect.
I think that should fix this. To me, that seems like a couple of PRs: one to
add `TypeUtil.project` that will produce empty structs instead of
fully-selected structs (maybe just a boolean option to configure
`PruneColumns`). Then one PR to update the behavior of `getProjectedIds` and
fix `select` calls that use the result to correctly call `project` or avoid
`getProjectedIds`. And finally, one PR to fix the second bug.
Does that sound reasonable?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]