alamb commented on code in PR #9240:
URL: https://github.com/apache/arrow-datafusion/pull/9240#discussion_r1492528011
##########
datafusion/common/src/file_options/mod.rs:
##########
@@ -97,6 +97,20 @@ impl StatementOptions {
maybe_option.map(|(_, v)| v)
}
+ /// Finds partition_by option if exists and parses into a `Vec<String>`.
+ /// If option doesn't exist, returns empty `vec![]`.
+ /// E.g. (partition_by 'colA, colB, colC') -> `vec!['colA','colB','colC']`
+ pub fn take_partition_by(&mut self) -> Vec<String> {
+ let partition_by = self.take_str_option("partition_by");
+ match partition_by {
+ Some(part_cols) => part_cols
+ .split(',')
+ .map(|s| s.trim().replace('\'', ""))
Review Comment:
Is it worth mentioning that this also removes all `'` from the column name
:thinking:
##########
datafusion/sqllogictest/test_files/copy.slt:
##########
@@ -25,6 +25,44 @@ COPY source_table TO 'test_files/scratch/copy/table/'
(format parquet, compressi
----
2
+# Copy to directory as partitioned files
+query IT
+COPY source_table TO 'test_files/scratch/copy/partitioned_table1/' (format
parquet, compression 'zstd(10)', partition_by 'col2');
+----
+2
+
+# validate multiple partitioned parquet file output
+statement ok
+CREATE EXTERNAL TABLE validate_partitioned_parquet STORED AS PARQUET
+LOCATION 'test_files/scratch/copy/partitioned_table1/' PARTITIONED BY (col2);
+
+query I?
+select * from validate_partitioned_parquet order by col1, col2;
Review Comment:
Could you also add a test for reading out of one of the partitions
Something like
```sql
select * from 'test_files/scratch/copy/partitioned_table1/col2=Foo'
```
To demonstrate that the output was actually partitioned ? I think this test
would pass even if the partition columns were ignored
##########
datafusion/core/src/datasource/file_format/write/demux.rs:
##########
@@ -319,14 +319,20 @@ fn compute_partition_keys_by_row<'a>(
) -> Result<Vec<Vec<&'a str>>> {
let mut all_partition_values = vec![];
- for (col, dtype) in partition_by.iter() {
+ // For the purposes of writing partitioned data, we can rely on schema
inference
Review Comment:
👍
##########
datafusion/proto/src/logical_plan/mod.rs:
##########
@@ -1641,6 +1642,7 @@ impl AsLogicalPlan for LogicalPlanNode {
output_url,
file_format,
copy_options,
+ partition_by: _,
Review Comment:
filed https://github.com/apache/arrow-datafusion/issues/9248
##########
datafusion/core/src/physical_planner.rs:
##########
@@ -585,13 +586,20 @@ impl DefaultPhysicalPlanner {
CopyOptions::WriterOptions(writer_options) =>
*writer_options.clone()
};
+ // Note: the DataType passed here is ignored for the
purposes of writing and inferred instead
Review Comment:
I agree this is a better UX -- thank you
##########
datafusion/core/src/dataframe/mod.rs:
##########
@@ -73,6 +73,9 @@ pub struct DataFrameWriteOptions {
/// Allows compression of CSV and JSON.
/// Not supported for parquet.
compression: CompressionTypeVariant,
+ /// Sets which columns should be used for hive-style partitioned writes by
name.
+ /// Can be set to empty vec![] for non-partitioned writes.
+ partition_by: Vec<String>,
Review Comment:
I think we need a test for this new feature `DataFrame::write_parquet`
I took a look around and I didn't see any good existing tests sadly. This is
what I found.
https://github.com/apache/arrow-datafusion/blob/4d389c2590370d85bfe3af77f5243d5b40f5a222/datafusion/core/src/datasource/physical_plan/parquet/mod.rs#L2070
I'll make a short PR to move those tests into the dataframe tests to make it
more discoverable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]