Github user HyukjinKwon commented on the pull request:
https://github.com/apache/spark/pull/11016#issuecomment-199118697
I found a similar issue with this,
[SPARK-1849](https://issues.apache.org/jira/browse/SPARK-1849).
I think we might have to do not support non-ascii compatible encodings
because it looks this PR will support general encodings but I cannot guarantee
it supports all the encodings. I mean, this will support general encodings but
there might be some encodings writing a BOM-bits-like header.
Since Spark CSV is already supporting the encoding option, I cannot come up
with more than three options below:
- Only CSV data source supports some encodings for backward compatibility
but except non-ascii compatible encodings and throws an exception when it is
non-ascii compatible encodings.
- CSV data source supports other encodings in this way but there are
documentations to mention it does not guarantee all the encodings.
- Supports all the encodings and add the tests for all the encodings (maybe
with this [encoding
list](https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html)
in Java)
@srowen Would you maybe give some feedback please?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]