[
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982669#comment-15982669
]
Armin Braun commented on SPARK-20336:
-------------------------------------
[~priancho] my bad apparently in the above. I can't retrace the exact version I
ran on (maybe I mistakenly ran an old revision, sorry about that).
But I see the same with `master` revision `31345fde82` from today.
{code}
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
17/04/25 12:14:55 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/04/25 12:14:57 WARN yarn.Client: Neither spark.yarn.jars nor
spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://192.168.178.57:4040
Spark context available as 'sc' (master = yarn, app id =
application_1493115274587_0001).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.read.option("wholeFile", true).option("header",
true).csv("file:///tmp/sample.csv").show()
+----+----+-------------+
|col1|col2| col3|
+----+----+-------------+
| 1| a| text|
| 2| b| テキスト|
| 3| c| 텍스트|
| 4| d|text
テキスト
텍스트|
| 5| e| last|
+----+----+-------------+
{code}
> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode
> characters
> --------------------------------------------------------------------------------------
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
> Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> +----+----+----+
> |col1|col2|col3|
> +----+----+----+
> | 1| a|text|
> | 2| b|テキスト|
> | 3| c| 텍스트|
> | 4| d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> | 5| e|last|
> +----+----+----+
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True,
> wholeFile=True)
> testdf_wholefile.show()
> +----+----+--------------------+
> |col1|col2| col3|
> +----+----+--------------------+
> | 1| a| text|
> | 2| b| ������������|
> | 3| c| ���������|
> | 4| d|text
> ������������...|
> | 5| e| last|
> +----+----+--------------------+
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]