Jefffrey commented on issue #13323:
URL: https://github.com/apache/datafusion/issues/13323#issuecomment-3604980696

   So my understanding of this is as follows:
   
   - #9041 removed the `single_file_output` from `FileSinkConfig` in order to 
have a more intuitive SQL API; that is, for SQL `COPY TO` statements it would 
now infer from the path provided if it should write to a single file or a 
directory based on the trailing `/` if present
   - #13079 extended this to include checking if it ends with a valid 
extension; that is, if the path ends with `.parquet` then it writes to single 
file, otherwise if it ends with `/` (including `.parquet/`) then it writes into 
a directory
   
   Now we are running into the problem of the current 
`DataFrameWriteOptions::with_single_file_output` not being respected because it 
clashes with the above heuristics.
   
   **Way forward**
   
   I think trying to fix this issue by plugging in another heuristic in 
addition to the above is a bit hacky and not a long term solution (#17009).
   
   It seems it would cause more confusion and make things harder to maintain. I 
think the suggestion in the original post has merit:
   
   > Considering the introduction of the extension-based heuristic I would 
suggest the following behavior:
   > - `with_single_file_output` is not called (`single_file_output == None`) - 
apply the heuristic
   > - `with_single_file_output(true)` - produce a single file at the exact 
path specified
   > - `with_single_file_output(false)` - create directory under specified path 
if doesn't exist and write one or many files
   
   Whether we use this existing config or name it something else, I like this 
way of specifying if it should use the default heuristic behaviour (more 
intuitive) or respecting the users choice if they explicitly want a single file 
or directory, especially in cases where they might want to write a single file 
that doesn't end in a valid extension (e.g. `parquet`) for whatever reason.
   
   I admit I don't know how this would work in terms of the DataFusion 
internals and plumbing required, I haven't looked that far, but I think we 
should agree on this external API so we can then work our way inwards to 
implement it correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to