[I] Error When Querying Partitioned JSON Table [arrow-datafusion]

via GitHub Thu, 12 Oct 2023 14:50:22 -0700


devinjdangelo opened a new issue, #7816:
URL: https://github.com/apache/arrow-datafusion/issues/7816


   ### Describe the bug
   
   I am seeing the following error when attempting to query a table of hive 
style partitioned JSON files via `datafusion-cli`:
   
   `Arrow error: Json error: Encountered unmasked nulls in non-nullable 
StructArray child: Field { name: "a", data_type: Utf8, nullable: false, 
dict_id: 0, dict_is_ordered: false, metadata: {} }`
   
   ### To Reproduce
   
   I have a table with the following directory structure:
   
   ```bash
   dev@dev:~/arrow-datafusion/test_table$ ls -lhR
   .:
   total 12K
   drwxrwxr-x 2 dev dev 4.0K Oct 12 17:35 'a=2'
   drwxrwxr-x 2 dev dev 4.0K Oct 12 17:35 'a=4'
   drwxrwxr-x 2 dev dev 4.0K Oct 12 17:35 'a=6'
   
   './a=2':
   total 4.0K
   -rw-rw-r-- 1 dev dev 20 Oct 12 17:35 tn0Sfag4abaDm6i2.json
   
   './a=4':
   total 4.0K
   -rw-rw-r-- 1 dev dev 20 Oct 12 17:35 tn0Sfag4abaDm6i2.json
   
   './a=6':
   total 4.0K
   -rw-rw-r-- 1 dev dev 20 Oct 12 17:35 tn0Sfag4abaDm6i2.json
   ```
   
   And the JSON files look like:
   ```bash
   dev@dev:~/arrow-datafusion/test_table$ cat a\=2/tn0Sfag4abaDm6i2.json 
   {"b":"1"}
   {"b":"1"}
   ```
   
   Attempting to query like the following fails:
   ```bash
   dev@dev:~/arrow-datafusion$ datafusion-cli
   DataFusion CLI v32.0.0
   ❯ CREATE EXTERNAL TABLE
   json_test(a string, b string)
   STORED AS json
   LOCATION './test_table'
   PARTITIONED BY (a);
   0 rows in set. Query took 0.001 seconds.
   
   ❯ select * from json_test;
   Arrow error: Json error: Encountered unmasked nulls in non-nullable 
StructArray child: Field { name: "a", data_type: Utf8, nullable: false, 
dict_id: 0, dict_is_ordered: false, metadata: {} }
   ❯ 
   ```
   
   Querying without the partitions defined works as expected:
   ```bash
   dev@dev:~/arrow-datafusion$ datafusion-cli
   DataFusion CLI v32.0.0
   ❯ CREATE EXTERNAL TABLE
   json_test(b string)
   STORED AS json
   LOCATION './test_table';
   0 rows in set. Query took 0.001 seconds.
   
   ❯ select * from json_test;
   +---+
   | b |
   +---+
   | 3 |
   | 3 |
   | 1 |
   | 1 |
   | 5 |
   | 5 |
   +---+
   6 rows in set. Query took 0.002 seconds.
   ```
   
   The exact same table structure DDL works for CSV and parquet files, but not 
JSON.
   
   ### Expected behavior
   
   The above json query should work.
   
   ### Additional context
   
   I discovered this while working on 
https://github.com/apache/arrow-datafusion/pull/7801/files#diff-0580d65ff5db0c78c1fa4cf693f2567e7d2394923412687560614836401c223f,
 and there are additional relevant tests in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Error When Querying Partitioned JSON Table [arrow-datafusion]

Reply via email to