[GitHub] [arrow-datafusion] theelderbeever opened a new issue, #7686: NDJsonExec doesn't properly apply predicates on partitioned tables.

via GitHub Thu, 28 Sep 2023 12:59:01 -0700


theelderbeever opened a new issue, #7686:
URL: https://github.com/apache/arrow-datafusion/issues/7686


   ### Describe the bug
   
   Performing a SQL query against a NDJson with partition columns will fail 
when filtering on any of the partition columns with the following error. In 
this case my partition column is a timestamp but it holds for other types as 
well.
   
   > ArrowError(JsonError("Encountered unmasked nulls in non-nullable 
StructArray child: Field { name: \"hourly_timestamp\", data_type: 
Timestamp(Second, None), nullable: false, dict_id: 0, dict_is_ordered: false, 
metadata: {} }"))
   
   It correctly prunes the files however, it doesn't populate the partition 
predicate correctly. This is in contrast to the ParquetExec which adds an extra 
predicate to populate the partition column.
   
   > JsonExec: file_groups={1 group: 
[[Users/taylorbeever/git/theelderbeever/df-test/data/ndjson/hourly_timestamp=2023-09-25T20:00:00/data.ndjson]]},
 projection=[id, timestamp, value, hourly_timestamp]
   
   > ParquetExec: file_groups={1 group: 
[[Users/taylorbeever/git/theelderbeever/df-test/data/parquet/hourly_timestamp=2023-09-25T20:00:00/data.zstd.parquet]]},
 projection=[id, timestamp, value, hourly_timestamp], 
predicate=hourly_timestamp@3 = 1695672000
   
   Attempted solutions - all fail:
   - Add partition columns to each json file.
   - Define the Schema for the table
   - Other datatypes for partition
   
   ### To Reproduce
   
   I created an example repo 
[here](https://github.com/theelderbeever/datafusion-ndjson-issue).
   
   Example data is included in the repo. All code is contained in 
`src/main.rs`. The parquet files are identical data to ndjson files.  They do 
not contain a column for the partition column as written.
   
   To run:
   
   First the parquet one which will succeed.
   ```console
   RUST_LOG=debug cargo run -- parquet
   ```
   
   Then the ndjson which will fail.
   ```console
   RUST_LOG=debug cargo run -- ndjson
   ```
   
   
   ### Expected behavior
   
   Partitioned table reads shouldn't fail when filtering on a partition column.
   
   Additionally, the default file_extension for `NDJsonReadOptions` is `.json` 
which is a little misleading. Its should be one of `.ndjson` or `.jsonl`.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] theelderbeever opened a new issue, #7686: NDJsonExec doesn't properly apply predicates on partitioned tables.

Reply via email to