Re: [PR] Support Copy To Partitioned Files [arrow-datafusion]

via GitHub Fri, 16 Feb 2024 07:18:44 -0800


alamb commented on code in PR #9240:
URL: https://github.com/apache/arrow-datafusion/pull/9240#discussion_r1492528011



##########
datafusion/common/src/file_options/mod.rs:
##########
@@ -97,6 +97,20 @@ impl StatementOptions {
         maybe_option.map(|(_, v)| v)
     }
 
+    /// Finds partition_by option if exists and parses into a `Vec<String>`.
+    /// If option doesn't exist, returns empty `vec![]`.
+    /// E.g. (partition_by 'colA, colB, colC') -> `vec!['colA','colB','colC']`
+    pub fn take_partition_by(&mut self) -> Vec<String> {
+        let partition_by = self.take_str_option("partition_by");
+        match partition_by {
+            Some(part_cols) => part_cols
+                .split(',')
+                .map(|s| s.trim().replace('\'', ""))

Review Comment:
   Is it worth mentioning that this also removes all `'` from the column name 
:thinking:



##########
datafusion/sqllogictest/test_files/copy.slt:
##########
@@ -25,6 +25,44 @@ COPY source_table TO 'test_files/scratch/copy/table/' 
(format parquet, compressi
 ----
 2
 
+# Copy to directory as partitioned files
+query IT
+COPY source_table TO 'test_files/scratch/copy/partitioned_table1/' (format 
parquet, compression 'zstd(10)', partition_by 'col2');
+----
+2
+
+# validate multiple partitioned parquet file output
+statement ok
+CREATE EXTERNAL TABLE validate_partitioned_parquet STORED AS PARQUET 
+LOCATION 'test_files/scratch/copy/partitioned_table1/' PARTITIONED BY (col2);
+
+query I?
+select * from validate_partitioned_parquet order by col1, col2;

Review Comment:
   Could you also add a test for reading out of one of the partitions
   
   Something like
   ```sql
   select * from 'test_files/scratch/copy/partitioned_table1/col2=Foo'
   ```
   
   To demonstrate that the output was actually partitioned ? I think this test 
would pass even if the partition columns were ignored



##########
datafusion/core/src/datasource/file_format/write/demux.rs:
##########
@@ -319,14 +319,20 @@ fn compute_partition_keys_by_row<'a>(
 ) -> Result<Vec<Vec<&'a str>>> {
     let mut all_partition_values = vec![];
 
-    for (col, dtype) in partition_by.iter() {
+    // For the purposes of writing partitioned data, we can rely on schema 
inference

Review Comment:
   👍 



##########
datafusion/proto/src/logical_plan/mod.rs:
##########
@@ -1641,6 +1642,7 @@ impl AsLogicalPlan for LogicalPlanNode {
                 output_url,
                 file_format,
                 copy_options,
+                partition_by: _,

Review Comment:
   filed https://github.com/apache/arrow-datafusion/issues/9248



##########
datafusion/core/src/physical_planner.rs:
##########
@@ -585,13 +586,20 @@ impl DefaultPhysicalPlanner {
                         CopyOptions::WriterOptions(writer_options) => 
*writer_options.clone()
                     };
 
+                    // Note: the DataType passed here is ignored for the 
purposes of writing and inferred instead

Review Comment:
   I agree this is a better UX -- thank you



##########
datafusion/core/src/dataframe/mod.rs:
##########
@@ -73,6 +73,9 @@ pub struct DataFrameWriteOptions {
     /// Allows compression of CSV and JSON.
     /// Not supported for parquet.
     compression: CompressionTypeVariant,
+    /// Sets which columns should be used for hive-style partitioned writes by 
name.
+    /// Can be set to empty vec![] for non-partitioned writes.
+    partition_by: Vec<String>,

Review Comment:
   I think we need a test for this new feature `DataFrame::write_parquet`
   
   I took a look around and I didn't see any good existing tests sadly. This is 
what I found. 
   
   
   
https://github.com/apache/arrow-datafusion/blob/4d389c2590370d85bfe3af77f5243d5b40f5a222/datafusion/core/src/datasource/physical_plan/parquet/mod.rs#L2070
   
   I'll make a short PR to move those tests into the dataframe tests to make it 
more discoverable 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Support Copy To Partitioned Files [arrow-datafusion]

Reply via email to