ZiyaZa opened a new pull request, #52557:
URL: https://github.com/apache/spark/pull/52557
### What changes were proposed in this pull request?
Currently, if all fields of a struct mentioned in the read schema are
missing in a Parquet file, the reader populates the struct with nulls.
This PR modifies the scan behavior so that if the struct exists in the
Parquet schema but none of the fields from the read schema are present, we
instead pick an arbitrary field from the Parquet file to read and use that to
populate NULLs (as well as outer NULLs and array sizes if the struct is nested
in another nested type).
This is done by changing the schema requested by the readers. We add an
additional field to the requested schema when clipping the Parquet file schema
according to the Spark schema. This means that the readers actually read and
return more data than requested, which can cause problems. This is only a
problem for the `VectorizedParquetRecordReader`, since for the other read code
path via parquet-mr, we already have an `UnsafeProjection` for outputting only
requested schema fields in `ParquetFileFormat`.
To ensure `VectorizedParquetRecordReader` only returns Spark requested
fields, we create the `ColumnarBatch` with vectors that match the requested
schema (we get rid of the additional fields by recursively matching
`sparkSchema` with `sparkRequestedSchema` and ensuring structs have the same
length in both). Then `ParquetColumnVector`s are responsible for allocating
dummy vectors to hold the data temporarily while reading, but these are not
exposed to the outside.
The heuristic to pick the arbitrary field is as follows: we pick one at the
lowest array nesting level (i.e., any scalar field is preferred to `array`,
which is preferred to `array<array>`), and prefer narrower scalar fields over
wider scalar fields, which are preferred over strings.
### Why are the changes needed?
This is a bug fix, because we were incorrectly assuming non-null struct
values to be missing from the file depending on requested fields and returning
null values.
### Does this PR introduce _any_ user-facing change?
Yes. We previously assumed structs to be null if all the fields we are
trying to read from a Parquet file were missing from that file, even if the
file contained other fields that could be used to take definition levels from.
See an example from the Jira ticket below:
```python
df_a = sql('SELECT 1 as id, named_struct("a", 1) AS s')
path = "/tmp/missing_col_test"
df_a.write.format("parquet").save(path)
df_b = sql('SELECT 2 as id, named_struct("b", 3) AS s')
spark.read.format("parquet").schema(df_b.schema).load(path).show()
```
This used to return:
```
+---+----+
| id| s|
+---+----+
| 1|NULL|
+---+----+
```
It now returns:
```
+---+------+
| id| s|
+---+------+
| 1|{NULL}|
+---+------+
```
### How was this patch tested?
Added new unit tests, also fixed an old test to expect this new behavior.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]