devinjdangelo commented on PR #9240: URL: https://github.com/apache/arrow-datafusion/pull/9240#issuecomment-1947588213
> Wondering as we write partitioned data then should also test the underlying file/folder structure, how the data was written to disk? The copy.slt tests rely on the read path for partitioned tables to make sure the files were written out correctly. ```sql # Copy to directory as partitioned files query ITT COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 'test_files/scratch/copy/partitioned_table2/' (format parquet, compression 'zstd(10)', partition_by 'column2, column3'); ---- 3 # validate multiple partitioned parquet file output statement ok CREATE EXTERNAL TABLE validate_partitioned_parquet2 STORED AS PARQUET LOCATION 'test_files/scratch/copy/partitioned_table2/' PARTITIONED BY (column2, column3); query I?? select * from validate_partitioned_parquet2 order by column1,column2,column3; ---- 1 a x 2 b y 3 c z ``` The code currently doesn't do any checking/validation of existing directories or files before writing. DuckDB describes some of the options they have for controlling this behavior here https://duckdb.org/docs/data/partitioning/partitioned_writes.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
