Adrian Bridgett created SPARK-18571:
---------------------------------------
Summary: pyspark: UTF-8 not written correctly (as CSV) when locale
is not UTF-8
Key: SPARK-18571
URL: https://issues.apache.org/jira/browse/SPARK-18571
Project: Spark
Issue Type: Bug
Components: Input/Output
Affects Versions: 2.0.2
Reporter: Adrian Bridgett
Sample code attached, code run with hadoop 2.7.3
If I run this with --master='local[*]' and LANG=en_US.UTF-8, then in _another_
terminal (which has LANG=en_US.UTF-8 set) cat the file, I see the Pi character
I expect.
Back to the first terminal, set LANG=C (or unset it) and rerun, then check the
output in the other terminal (still set to en_US.UTF-8) and it's corrupted.
I actually noticed this as when I run it with our normal mesos scheduler, the
data is corrupted (those boxes do have LANG=en_US.UTF-8 but perhaps it's not
being picked up).
I don't remember needing to do this on Spark-1.6.1 (hadoop-2.7.1).
Expected characters: 0x80cf
Received: 0xbfef efbd bdbf
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]