cdelmonte-zg opened a new issue, #9844:
URL: https://github.com/apache/arrow-rs/issues/9844
This does not appear to be a duplicate of #8776 / #8777. Those addressed
`LogicalType::_Unknown { field_id }`, i.e. the forward-compatibility variant
used when the reader encounters an unknown logical-type union value.
This issue is about `LogicalType::Unknown`, the canonical Parquet `UNKNOWN`
logical type. In arrow-rs these are separate enum variants in
`parquet/src/basic.rs`.
## To Reproduce
The Delta Acceptance Tests contain expected-output Parquet files for `void`
/ `NullType` workloads. These files appear to contain columns encoded as
`BOOLEAN` physical type annotated with `LogicalType::Unknown`.
Reading such a file with arrow-rs fails during schema validation:
```rust
let f = File::open("spark_void.parquet").unwrap();
let reader = ParquetRecordBatchReaderBuilder::try_new(f)?;
```
This returns:
```text
General("Cannot annotate Unknown from BOOLEAN for field 'void_col'")
```
A fully programmatic in-memory repro is awkward because the type builder
rejects this physical/logical type combination at write time. The issue is
therefore only visible when reading a file produced by an external writer.
Example files are available in the [Delta Acceptance Tests
repository](https://github.com/delta-io/dat) under `void_NNN_*` workloads, in
`expected_data/`.
## Expected behavior
My understanding is that `LogicalType::Unknown` represents a column whose
values are always null. If so, the Parquet reader should probably accept this
annotation independently of the physical type and expose the column as
`DataType::Null`.
Please correct me if this interpretation of `UNKNOWN` is wrong.
## Possible fix direction
A fix may need to:
1. Broaden the validator in `parquet/src/schema/types.rs` so that
`LogicalType::Unknown` is not restricted to `INT32`.
2. Ensure `parquet/src/arrow/schema/primitive.rs` maps
`LogicalType::Unknown` to `DataType::Null` for all primitive physical types.
`from_int32` already appears to handle this; the other primitive type mappers
do not.
## Versions
- `parquet = "58.1.0"`
- Also reproduced with `parquet = "57.3.0"`
- Also reproduced on current `main`
## Component(s)
Parquet
## Context
This blocks `delta-io/delta-kernel-rs#1858` from running the Delta
Acceptance Tests for `void` / `NullType` columns end-to-end.
The Delta tables themselves load fine because Delta does not materialize
`void` columns in data Parquet files. However, the Delta Acceptance Tests
harness reads the expected-output Parquet files written by Spark, and those
files can contain `BOOLEAN + UNKNOWN` columns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]