alamb opened a new issue, #7892:
URL: https://github.com/apache/arrow-datafusion/issues/7892
### Describe the bug
writing to partitioned table uses the wrong column as partition key
### To Reproduce
```sql
❯ create external table test(partition varchar, trace_id varchar) stored as
parquet partitioned by (partition_id) location '/tmp/test' options
(create_local_path 'true');
Arrow error: Schema error: Unable to get field named "partition_id". Valid
fields: ["partition", "trace_id"]
❯ create external table test(partition varchar, trace_id varchar) stored as
parquet partitioned by (partition) location '/tmp/test' options
(create_local_path 'true');
0 rows in set. Query took 0.001 seconds.
❯ insert into test select * from 'input.parquet';
Object Store error: Generic LocalFileSystem error: Unable to open file
/private/tmp/test/partition=00000000000000003088e9e74cf166bd/QZSOEmePoAsQkvzU.parquet#1:
Too many open files (os error 24)
❯
```
It looks like it used the wrong column as the partition key:
```shell
alamb@MacBook-Pro-8:~/Software/arrow-datafusion$ ls /tmp/test | wc -l
19992
alamb@MacBook-Pro-8:~/Software/arrow-datafusion$ ls /tmp/test | head
partition=0000000000000000000102576ce2faea/
partition=00000000000000000004d8eb49424f0d/
partition=000000000000000000051e89839e2bb0/
partition=00000000000000000005406b87ebcb41/
partition=000000000000000000065226d41eba99/
partition=000000000000000000066e556bea9c68/
partition=0000000000000000000688030531c6ff/
partition=000000000000000000068839bb143b45/
partition=00000000000000000006aaa220390696/
partition=00000000000000000006b0ebabd460a3/
```
Here is the input:
[`input.parquet`](https://drive.google.com/file/d/1Hfld5jbKsSfp7vVTSBNIcV_KuvWIxc5D/view?usp=sharing)
Here is how I made it:
```sql
❯ copy (select 'a' as "partition", trace_id from traces UNION ALL select 'b'
as "partition", trace_id from traces UNION ALL select 'c' as "partition",
trace_id from traces) to 'input.parquet';
+----------+
| count |
+----------+
| 15557151 |
+----------+
1 row in set. Query took 3.639 seconds.
❯ select * from 'input.parquet' limit 1;
+-----------+----------------------------------+
| partition | trace_id |
+-----------+----------------------------------+
| b | 000000000000000028bf4438cad62275 |
+-----------+----------------------------------+
1 row in set. Query took 0.009 seconds.
❯ describe 'input.parquet';
+-------------+-------------------------+-------------+
| column_name | data_type | is_nullable |
+-------------+-------------------------+-------------+
| partition | Utf8 | NO |
| trace_id | Dictionary(Int32, Utf8) | YES |
+-------------+-------------------------+-------------+
2 rows in set. Query took 0.001 seconds.
```
### Expected behavior
I expect a three output directories to be created, for each of the distinct
values of `partition_id`, `a`, `b` and `c`
```
/tmp/test/partition_id=a/<uuid>.parquet
/tmp/test/partition_id=b/<uuid>.parquet
/tmp/test/partition_id=c/<uuid>.parquet
```
### Additional context
Found as part of
https://github.com/apache/arrow-datafusion/pull/7801#pullrequestreview-1687734166
@devinjdangelo did some initial investigation:
https://github.com/apache/arrow-datafusion/pull/7801#issuecomment-1771867480
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]