devinjdangelo opened a new issue, #9290:
URL: https://github.com/apache/arrow-datafusion/issues/9290

   ### Describe the bug
   
   All existing tests of CopyTo with partition_by partition on the columns at 
the end of the schema. However in general a user could provide any column 
within the partition_by option and expect it to work. Internally, the code 
assumes that the partition columns fall at the end of the schema, which causes 
the wrong columns to be removed from the data written to disk.
   
   ### To Reproduce
   
   ```sql
   # Copy to directory as partitioned files
   COPY (values ('1', 'a', 'x'), ('2', 'b', 'y'), ('3', 'c', 'z')) TO 
'test_files/scratch/copy/partitioned_table3/' 
   (format parquet, compression 'zstd(10)', partition_by 'column1, column3');
   ----
   3
   
   # validate multiple partitioned parquet file output
   CREATE EXTERNAL TABLE validate_partitioned_parquet3 STORED AS PARQUET 
   LOCATION 'test_files/scratch/copy/partitioned_table3/' PARTITIONED BY 
(column1, column3);
   
   select column1, column2, column3 from validate_partitioned_parquet3 order by 
column1,column2,column3;
   ----
   1 1 x
   2 2 y
   3 3 z
   ```
   
   Note that partitioning on the first and third column leads to the second 
column being eliminated incorrectly from the written data.
   
   ### Expected behavior
   
   The data should be written to disk with the partition column values as 
directories and the remaining columns encoded in the files, regardless of their 
ordering in the original data.
   
   ```sql
   # Copy to directory as partitioned files
   query TTT
   COPY (values ('1', 'a', 'x'), ('2', 'b', 'y'), ('3', 'c', 'z')) TO 
'test_files/scratch/copy/partitioned_table3/' 
   (format parquet, compression 'zstd(10)', partition_by 'column1, column3');
   ----
   3
   
   # validate multiple partitioned parquet file output
   statement ok
   CREATE EXTERNAL TABLE validate_partitioned_parquet3 STORED AS PARQUET 
   LOCATION 'test_files/scratch/copy/partitioned_table3/' PARTITIONED BY 
(column1, column3);
   
   query ?T?
   select column1, column2, column3 from validate_partitioned_parquet3 order by 
column1,column2,column3;
   ----
   1 a x
   2 b y
   3 c z
   
   statement ok
   CREATE EXTERNAL TABLE validate_partitioned_parquet_1_x STORED AS PARQUET 
   LOCATION 'test_files/scratch/copy/partitioned_table3/column1=1/column3=x';
   
   query T
   select * from validate_partitioned_parquet_1_x order by column2;
   ----
   a
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to