devinjdangelo opened a new issue, #9290:
URL: https://github.com/apache/arrow-datafusion/issues/9290
### Describe the bug
All existing tests of CopyTo with partition_by partition on the columns at
the end of the schema. However in general a user could provide any column
within the partition_by option and expect it to work. Internally, the code
assumes that the partition columns fall at the end of the schema, which causes
the wrong columns to be removed from the data written to disk.
### To Reproduce
```sql
# Copy to directory as partitioned files
COPY (values ('1', 'a', 'x'), ('2', 'b', 'y'), ('3', 'c', 'z')) TO
'test_files/scratch/copy/partitioned_table3/'
(format parquet, compression 'zstd(10)', partition_by 'column1, column3');
----
3
# validate multiple partitioned parquet file output
CREATE EXTERNAL TABLE validate_partitioned_parquet3 STORED AS PARQUET
LOCATION 'test_files/scratch/copy/partitioned_table3/' PARTITIONED BY
(column1, column3);
select column1, column2, column3 from validate_partitioned_parquet3 order by
column1,column2,column3;
----
1 1 x
2 2 y
3 3 z
```
Note that partitioning on the first and third column leads to the second
column being eliminated incorrectly from the written data.
### Expected behavior
The data should be written to disk with the partition column values as
directories and the remaining columns encoded in the files, regardless of their
ordering in the original data.
```sql
# Copy to directory as partitioned files
query TTT
COPY (values ('1', 'a', 'x'), ('2', 'b', 'y'), ('3', 'c', 'z')) TO
'test_files/scratch/copy/partitioned_table3/'
(format parquet, compression 'zstd(10)', partition_by 'column1, column3');
----
3
# validate multiple partitioned parquet file output
statement ok
CREATE EXTERNAL TABLE validate_partitioned_parquet3 STORED AS PARQUET
LOCATION 'test_files/scratch/copy/partitioned_table3/' PARTITIONED BY
(column1, column3);
query ?T?
select column1, column2, column3 from validate_partitioned_parquet3 order by
column1,column2,column3;
----
1 a x
2 b y
3 c z
statement ok
CREATE EXTERNAL TABLE validate_partitioned_parquet_1_x STORED AS PARQUET
LOCATION 'test_files/scratch/copy/partitioned_table3/column1=1/column3=x';
query T
select * from validate_partitioned_parquet_1_x order by column2;
----
a
```
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]