Re: [PR] Support Copy To Partitioned Files [arrow-datafusion]

via GitHub Thu, 15 Feb 2024 07:08:05 -0800


devinjdangelo commented on code in PR #9240:
URL: https://github.com/apache/arrow-datafusion/pull/9240#discussion_r1491159044



##########
datafusion/proto/src/logical_plan/mod.rs:
##########
@@ -1641,6 +1642,7 @@ impl AsLogicalPlan for LogicalPlanNode {
                 output_url,
                 file_format,
                 copy_options,
+                partition_by: _,

Review Comment:
   Note that I did not add support for partition_by in proto. We should add a 
follow up ticket for this. 
   
   I don't believe this PR will break downstream systems like Ballista's 
handling of COPY, but it will silently ignore partition_by options until prost 
is updated.



##########
datafusion/core/src/physical_planner.rs:
##########
@@ -585,13 +586,20 @@ impl DefaultPhysicalPlanner {
                         CopyOptions::WriterOptions(writer_options) => 
*writer_options.clone()
                     };
 
+                    // Note: the DataType passed here is ignored for the 
purposes of writing and inferred instead

Review Comment:
   The read path needs an explicit DataType defined for the partition cols so 
it knows what to cast to, but I realized that the write path can just infer the 
correct DataType from the RecordBatch schema.
   
   This allows COPY to only specify partition columns by name and not have to 
worry about specifying the correct data type.



##########
datafusion/sqllogictest/test_files/copy.slt:
##########
@@ -25,6 +25,44 @@ COPY source_table TO 'test_files/scratch/copy/table/' 
(format parquet, compressi
 ----
 2
 
+# Copy to directory as partitioned files
+query IT
+COPY source_table TO 'test_files/scratch/copy/partitioned_table1/' (format 
parquet, compression 'zstd(10)', partition_by 'col2');
+----
+2
+
+# validate multiple partitioned parquet file output
+statement ok
+CREATE EXTERNAL TABLE validate_partitioned_parquet STORED AS PARQUET 
+LOCATION 'test_files/scratch/copy/partitioned_table1/' PARTITIONED BY (col2);
+
+query I?
+select * from validate_partitioned_parquet;
+----
+2 Bar
+1 Foo
+
+# Copy to directory as partitioned files
+query ITT
+COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 
'test_files/scratch/copy/partitioned_table2/' 
+(format parquet, compression 'zstd(10)', partition_by 'column2, column3');

Review Comment:
   Anonymous columns get the name "columnX" based on their order in the VALUES 
clause. It would be nice to document this somewhere, though I did make sure it 
is relatively easy to discover this based on the error message if you get a 
column name wrong.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Support Copy To Partitioned Files [arrow-datafusion]

Reply via email to