ZiyaZa opened a new pull request, #52557:
URL: https://github.com/apache/spark/pull/52557

   ### What changes were proposed in this pull request?
   
   Currently, if all fields of a struct mentioned in the read schema are 
missing in a Parquet file, the reader populates the struct with nulls.
   
   This PR modifies the scan behavior so that if the struct exists in the 
Parquet schema but none of the fields from the read schema are present, we 
instead pick an arbitrary field from the Parquet file to read and use that to 
populate NULLs (as well as outer NULLs and array sizes if the struct is nested 
in another nested type).
   
   This is done by changing the schema requested by the readers. We add an 
additional field to the requested schema when clipping the Parquet file schema 
according to the Spark schema. This means that the readers actually read and 
return more data than requested, which can cause problems. This is only a 
problem for the `VectorizedParquetRecordReader`, since for the other read code 
path via parquet-mr, we already have an `UnsafeProjection` for outputting only 
requested schema fields in `ParquetFileFormat`.
   
   To ensure `VectorizedParquetRecordReader` only returns Spark requested 
fields, we create the `ColumnarBatch` with vectors that match the requested 
schema (we get rid of the additional fields by recursively matching 
`sparkSchema` with `sparkRequestedSchema` and ensuring structs have the same 
length in both). Then `ParquetColumnVector`s are responsible for allocating 
dummy vectors to hold the data temporarily while reading, but these are not 
exposed to the outside.
   
   The heuristic to pick the arbitrary field is as follows: we pick one at the 
lowest array nesting level (i.e., any scalar field is preferred to `array`, 
which is preferred to `array<array>`), and prefer narrower scalar fields over 
wider scalar fields, which are preferred over strings.
   
   ### Why are the changes needed?
   
   This is a bug fix, because we were incorrectly assuming non-null struct 
values to be missing from the file depending on requested fields and returning 
null values.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. We previously assumed structs to be null if all the fields we are 
trying to read from a Parquet file were missing from that file, even if the 
file contained other fields that could be used to take definition levels from. 
See an example from the Jira ticket below:
   
   ```python
   df_a = sql('SELECT 1 as id, named_struct("a", 1) AS s')
   path = "/tmp/missing_col_test"
   df_a.write.format("parquet").save(path)
   
   df_b = sql('SELECT 2 as id, named_struct("b", 3) AS s')
   spark.read.format("parquet").schema(df_b.schema).load(path).show()
   ```
   
   This used to return:
   
   ```
   +---+----+
   | id|   s|
   +---+----+
   |  1|NULL|
   +---+----+
   ```
   
   It now returns:
   
   ```
   +---+------+
   | id|     s|
   +---+------+
   |  1|{NULL}|
   +---+------+
   ```
   
   ### How was this patch tested?
   
   Added new unit tests, also fixed an old test to expect this new behavior.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to