rdblue commented on pull request #1744:
URL: https://github.com/apache/iceberg/pull/1744#issuecomment-890564959


   If I remember correctly, the main problem here was a mismatch between the 
expected schema and the actual schema when reading from Avro files. That was 
caused by some unexpected behaviors across a few helpers:
   1. `TypeUtil.select` (that uses `PruneColumns`) has behavior like SQL 
projection using IDs. Any field that is selected by ID is completely projected. 
For example, selecting `{4}` from the schema `3: id bigint, 4: location 
struct<5: lat double, 6: long double>` is like `SELECT location` and produces 
`4: location struct<5: lat double, 6 long double>` as the projected schema. As 
a result, empty structs can't be selected by this method (or in SQL).
   2. `TypeUtil.getProjectedIds` (that uses `GetProjectedIds`) is written 
expecting the current behavior from `TypeUtil.select`. When finding the IDs 
needed to describe a projection, it will return all leaf field IDs. Because 
empty structs contain no leaf fields, there are no IDs returned for an empty 
struct. Reusing the example above, if the requested projection were `4: 
location struct<>` (no fields) then the result is `{}` rather than `{4}` 
because `select` would return all fields of `location`.
   3. Record types are position based and readers assume that the expected 
schema will match the read schema, even though the read schema is produced 
using the file schema, conversions, and calls to the two methods above. An 
empty struct in the requested schema will be silently removed, producing a 
mismatch in the field positions. When reading with `id bigint, location 
struct<>, data string`, the data produced will be `id bigint, data string`, 
resulting in an `ArrayIndexOutOfBounds` exception or a `ClassCastException` 
depending on the rest of the projection.
   4. There is a bug that sometimes avoids this problem by accidentally 
projecting empty structs, but this is not reliable because sometimes the empty 
struct should not be produced or else it introduces the opposite problem: 
projecting `id bigint, data string` actually produces `id bigint, location 
struct<>, data string`.
   
   Hopefully that captures the issues here.
   
   First, I think that `select` behavior is correct. This matches the behavior 
where a list of selected fields should be turned into a list of IDs and 
completely projected. Selecting a struct should select all of its fields, as 
expected by `SELECT location`. The problem is that this is that we assume that 
the output of `getProjectedIds` will be used for `select` and so it can't 
return any non-leaf field IDs.
   
   To fix this, I propose the following:
   * Update the behavior of `getProjectedIds` to return an ID for empty 
structs. The ID should indicate that the struct must be returned without any 
fields, unless there are sub-fields selected by a nested field ID.
   * Add a `TypeUtil.project` method that is the opposite of `getProjectedIds`. 
The combination of `getProjectedIds` and `project` should always be the same 
schema, with empty structs.
   * Update all calls to `select` and avoid passing the result of 
`getProjectedIds` because of the behavior change. In some cases, this will 
actually fix behavior: `StructProjection.create(Schema base, Schema 
projection)` should use `project` to preserve empty structs, for example. But 
other places, like `BaseTableScan.lazyColumnProjection` can avoid calling 
`getProjectedIds` because it isn't needed (and would change behavior).
   * Update most calls that need to produce a schema matching the expected 
schema to use `project` instead of `select` so that empty structs are preserved.
   * Fix the second bug, which is unrelated but also incorrect.
   
   I think that should fix this. To me, that seems like a couple of PRs: one to 
add `TypeUtil.project` that will produce empty structs instead of 
fully-selected structs (maybe just a boolean option to configure 
`PruneColumns`). Then one PR to update the behavior of `getProjectedIds` and 
fix `select` calls that use the result to correctly call `project` or avoid 
`getProjectedIds`. And finally, one PR to fix the second bug.
   
   Does that sound reasonable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to