[
https://issues.apache.org/jira/browse/SPARK-50616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-50616:
----------------------------------
Component/s: SQL
(was: Spark Core)
> Add File Extension Option to CSV DataSource Writer
> --------------------------------------------------
>
> Key: SPARK-50616
> URL: https://issues.apache.org/jira/browse/SPARK-50616
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.5.3
> Reporter: James Baugh
> Assignee: James Baugh
> Priority: Minor
> Labels: pull-request-available
> Fix For: 4.0.0
>
>
> h3. What changes were proposed in this pull request?
> The existing CSV DataSource allows one to set the delimiter/separator but
> does not allow the changing of the file extension. This means that a file can
> have values separated by tabs but me marked as a ".csv" file. This change
> allows one to change the file extension to match the delimiter/separator
> (e.g. ".tsv" for a tab separated value file).
> PR: [https://github.com/apache/spark/pull/49233]
> h3. Why are the changes needed?
> This PR adds an additional option to set the fileExtension. The end result is
> that when a separator is set that is not a comma that the output file has a
> file extension that matches the separator (e.g. file.tsv, file.psv, etc...).
> Notes on Previous Pull Request
> [#17973|https://github.com/apache/spark/pull/17973]
> A pull request adding this option was discussed 7 years ago. One reason it
> wasn't added was:
> "I would like to suggest to leave this out if there is no better reason for
> now. Downside of this is, it looks this allows arbitrary name and it does not
> gurantee the extention is, say, tsv when the delmiter is a tab. It is purely
> up to the user."
> I don't believe this is a good reason to not let the user set the extension.
> If we let them set the delimiter/separator to an arbitrary string/char then
> why not let the user also set the file extension to specify the separator
> that the file uses (e.g. tsv, psv, etc...). This addition keeps the "csv"
> file extension as the default and has the benefit of allowing other
> separators to match the file extension.
> h3. Does this PR introduce _any_ user-facing change?
> Yes. This PR adds one row to the options table for the CSV DataSource
> documentation to include the "fileExtension" option.
> h3. How was this patch tested?
> One unit test was added to validate a file is written with the new extension.
> h3. Was this patch authored or co-authored using generative AI tooling?
> No
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]