theelderbeever opened a new issue, #7686:
URL: https://github.com/apache/arrow-datafusion/issues/7686
### Describe the bug
Performing a SQL query against a NDJson with partition columns will fail
when filtering on any of the partition columns with the following error. In
this case my partition column is a timestamp but it holds for other types as
well.
> ArrowError(JsonError("Encountered unmasked nulls in non-nullable
StructArray child: Field { name: \"hourly_timestamp\", data_type:
Timestamp(Second, None), nullable: false, dict_id: 0, dict_is_ordered: false,
metadata: {} }"))
It correctly prunes the files however, it doesn't populate the partition
predicate correctly. This is in contrast to the ParquetExec which adds an extra
predicate to populate the partition column.
> JsonExec: file_groups={1 group:
[[Users/taylorbeever/git/theelderbeever/df-test/data/ndjson/hourly_timestamp=2023-09-25T20:00:00/data.ndjson]]},
projection=[id, timestamp, value, hourly_timestamp]
> ParquetExec: file_groups={1 group:
[[Users/taylorbeever/git/theelderbeever/df-test/data/parquet/hourly_timestamp=2023-09-25T20:00:00/data.zstd.parquet]]},
projection=[id, timestamp, value, hourly_timestamp],
predicate=hourly_timestamp@3 = 1695672000
Attempted solutions - all fail:
- Add partition columns to each json file.
- Define the Schema for the table
- Other datatypes for partition
### To Reproduce
I created an example repo
[here](https://github.com/theelderbeever/datafusion-ndjson-issue).
Example data is included in the repo. All code is contained in
`src/main.rs`. The parquet files are identical data to ndjson files. They do
not contain a column for the partition column as written.
To run:
First the parquet one which will succeed.
```console
RUST_LOG=debug cargo run -- parquet
```
Then the ndjson which will fail.
```console
RUST_LOG=debug cargo run -- ndjson
```
### Expected behavior
Partitioned table reads shouldn't fail when filtering on a partition column.
Additionally, the default file_extension for `NDJsonReadOptions` is `.json`
which is a little misleading. Its should be one of `.ndjson` or `.jsonl`.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]