alamb opened a new issue, #7892:
URL: https://github.com/apache/arrow-datafusion/issues/7892

   ### Describe the bug
   
   writing to partitioned table uses the wrong column as partition key
   
   ### To Reproduce
   
   
   ```sql
   ❯ create external table test(partition varchar, trace_id varchar) stored as 
parquet partitioned by (partition_id) location '/tmp/test'  options 
(create_local_path 'true');
   Arrow error: Schema error: Unable to get field named "partition_id". Valid 
fields: ["partition", "trace_id"]
   ❯ create external table test(partition varchar, trace_id varchar) stored as 
parquet partitioned by (partition) location '/tmp/test'  options 
(create_local_path 'true');
   0 rows in set. Query took 0.001 seconds.
   
   ❯ insert into test select * from 'input.parquet';
   Object Store error: Generic LocalFileSystem error: Unable to open file 
/private/tmp/test/partition=00000000000000003088e9e74cf166bd/QZSOEmePoAsQkvzU.parquet#1:
 Too many open files (os error 24)
   ❯
   ```
   
   It looks like it used the wrong column as the partition key:
   
   
   ```shell
   alamb@MacBook-Pro-8:~/Software/arrow-datafusion$ ls /tmp/test | wc -l
      19992
   
   alamb@MacBook-Pro-8:~/Software/arrow-datafusion$ ls /tmp/test | head
   partition=0000000000000000000102576ce2faea/
   partition=00000000000000000004d8eb49424f0d/
   partition=000000000000000000051e89839e2bb0/
   partition=00000000000000000005406b87ebcb41/
   partition=000000000000000000065226d41eba99/
   partition=000000000000000000066e556bea9c68/
   partition=0000000000000000000688030531c6ff/
   partition=000000000000000000068839bb143b45/
   partition=00000000000000000006aaa220390696/
   partition=00000000000000000006b0ebabd460a3/
   ```
   
   
   Here is the input: 
[`input.parquet`](https://drive.google.com/file/d/1Hfld5jbKsSfp7vVTSBNIcV_KuvWIxc5D/view?usp=sharing)
   
   Here is how I made it:
   
   ```sql
   
   ❯ copy (select 'a' as "partition", trace_id from traces UNION ALL select 'b' 
as "partition", trace_id from traces UNION ALL select 'c' as "partition", 
trace_id from traces) to 'input.parquet';
   +----------+
   | count    |
   +----------+
   | 15557151 |
   +----------+
   1 row in set. Query took 3.639 seconds.
   
   ❯ select * from 'input.parquet' limit 1;
   +-----------+----------------------------------+
   | partition | trace_id                         |
   +-----------+----------------------------------+
   | b         | 000000000000000028bf4438cad62275 |
   +-----------+----------------------------------+
   1 row in set. Query took 0.009 seconds.
   
   ❯ describe 'input.parquet';
   +-------------+-------------------------+-------------+
   | column_name | data_type               | is_nullable |
   +-------------+-------------------------+-------------+
   | partition   | Utf8                    | NO          |
   | trace_id    | Dictionary(Int32, Utf8) | YES         |
   +-------------+-------------------------+-------------+
   2 rows in set. Query took 0.001 seconds.
   ```
   
   
   ### Expected behavior
   
   I expect a three output directories to be created, for each of the distinct 
values of `partition_id`, `a`, `b` and `c`
   
   ```
   /tmp/test/partition_id=a/<uuid>.parquet
   /tmp/test/partition_id=b/<uuid>.parquet
   /tmp/test/partition_id=c/<uuid>.parquet
   ```
   
   ### Additional context
   
   Found as part of 
https://github.com/apache/arrow-datafusion/pull/7801#pullrequestreview-1687734166
   
   @devinjdangelo  did some initial investigation: 
https://github.com/apache/arrow-datafusion/pull/7801#issuecomment-1771867480


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to