Jefffrey commented on code in PR #19931:
URL: https://github.com/apache/datafusion/pull/19931#discussion_r2726318905
##########
datafusion/core/src/physical_planner.rs:
##########
@@ -549,8 +549,30 @@ impl DefaultPhysicalPlanner {
}
};
+ // Parse single_file_output option if explicitly set
+ let file_output_mode = match source_option_tuples
Review Comment:
I wonder if it would be better to push this enum into `single_file_output`
itself as well instead of having this parse from an `Option<bool>` logic 🤔
##########
datafusion/proto/proto/datafusion.proto:
##########
@@ -771,6 +771,11 @@ message FileSinkConfig {
bool keep_partition_by_columns = 9;
InsertOp insert_op = 10;
string file_extension = 11;
+ // Optional override for single file output behavior.
+ // When not set, uses extension heuristic (path with extension = single
file).
+ // When set to true, forces single file output at exact path.
+ // When set to false, forces directory output with generated filenames.
+ optional bool single_file_output = 12;
Review Comment:
Might be better to represent this as an actual enum to more align with the
source 🤔
##########
datafusion-examples/examples/data_io/parquet_encrypted.rs:
##########
@@ -55,7 +55,7 @@ pub async fn parquet_encrypted() ->
datafusion::common::Result<()> {
// Create a temporary file location for the encrypted parquet file
let tmp_source = TempDir::new()?;
- let tempfile = tmp_source.path().join("cars_encrypted");
+ let tempfile = tmp_source.path().join("cars_encrypted.parquet");
Review Comment:
Actually why is this change needed for the example? Is it fixing some
behaviour made by this change or just clarifying the example?
##########
datafusion/core/src/physical_planner.rs:
##########
@@ -549,8 +549,30 @@ impl DefaultPhysicalPlanner {
}
};
+ // Parse single_file_output option if explicitly set
+ let file_output_mode = match source_option_tuples
+ .get("single_file_output")
+ .map(|v| v.trim())
+ {
+ None => FileOutputMode::Automatic,
+ Some("true") => FileOutputMode::SingleFile,
+ Some("false") => FileOutputMode::Directory,
+ Some(value) => {
+ return Err(DataFusionError::Configuration(format!(
+ "provided value for 'single_file_output' was not
recognized: \"{value}\""
+ )));
+ }
+ };
+
+ // Filter out sink-related options that are not format options
+ let format_options: HashMap<String, String> =
source_option_tuples
Review Comment:
What's the need for filtering the option here?
##########
datafusion/proto/tests/cases/roundtrip_physical_plan.rs:
##########
@@ -1475,6 +1475,7 @@ fn roundtrip_json_sink() -> Result<()> {
insert_op: InsertOp::Overwrite,
keep_partition_by_columns: true,
file_extension: "json".into(),
+ file_output_mode: FileOutputMode::Automatic,
Review Comment:
Can we use different modes for each of the tests for coverage
##########
docs/source/library-user-guide/upgrading.md:
##########
@@ -21,7 +21,37 @@
## DataFusion `53.0.0`
-**Note:** DataFusion `53.0.0` has not been released yet. The information
provided in this section pertains to features and changes that have already
been merged to the main branch and are awaiting release in this version.
+**Note:** DataFusion `53.0.0` has not been released yet. The information
provided
+*in this section pertains to features and changes that have already been merged
+*to the main branch and are awaiting release in this version. See [#19692] for
+\*more details.
+
+[#19692]: https://github.com/apache/datafusion/issues/19692
+
+### `FileSinkConfig` adds `single_file_output`
+
+`FileSinkConfig` now includes a `single_file_output: Option<bool>` field to
override the
Review Comment:
Reminder to update this
##########
datafusion/core/src/dataframe/mod.rs:
##########
@@ -125,6 +125,14 @@ impl DataFrameWriteOptions {
self.sort_by = sort_by;
self
}
+
+ fn copy_to_options(&self) -> HashMap<String, String> {
Review Comment:
This is a bit confusingly named for what it does
##########
datafusion/core/src/dataframe/parquet.rs:
##########
@@ -324,4 +326,52 @@ mod tests {
Ok(())
}
+
+ /// Test that single_file_output works for paths WITHOUT file extensions.
+ /// This verifies the fix for the regression where extension heuristics
+ /// ignored the explicit with_single_file_output(true) setting.
+ #[tokio::test]
+ async fn test_single_file_output_without_extension() -> Result<()> {
Review Comment:
Do we have tests for all the modes? automatic, single, directory
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]