Jefffrey commented on issue #13323: URL: https://github.com/apache/datafusion/issues/13323#issuecomment-3604980696
So my understanding of this is as follows: - #9041 removed the `single_file_output` from `FileSinkConfig` in order to have a more intuitive SQL API; that is, for SQL `COPY TO` statements it would now infer from the path provided if it should write to a single file or a directory based on the trailing `/` if present - #13079 extended this to include checking if it ends with a valid extension; that is, if the path ends with `.parquet` then it writes to single file, otherwise if it ends with `/` (including `.parquet/`) then it writes into a directory Now we are running into the problem of the current `DataFrameWriteOptions::with_single_file_output` not being respected because it clashes with the above heuristics. **Way forward** I think trying to fix this issue by plugging in another heuristic in addition to the above is a bit hacky and not a long term solution (#17009). It seems it would cause more confusion and make things harder to maintain. I think the suggestion in the original post has merit: > Considering the introduction of the extension-based heuristic I would suggest the following behavior: > - `with_single_file_output` is not called (`single_file_output == None`) - apply the heuristic > - `with_single_file_output(true)` - produce a single file at the exact path specified > - `with_single_file_output(false)` - create directory under specified path if doesn't exist and write one or many files Whether we use this existing config or name it something else, I like this way of specifying if it should use the default heuristic behaviour (more intuitive) or respecting the users choice if they explicitly want a single file or directory, especially in cases where they might want to write a single file that doesn't end in a valid extension (e.g. `parquet`) for whatever reason. I admit I don't know how this would work in terms of the DataFusion internals and plumbing required, I haven't looked that far, but I think we should agree on this external API so we can then work our way inwards to implement it correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
