[jira] [Comment Edited] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

HanCheol Cho (JIRA) Sun, 16 Apr 2017 18:20:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970259#comment-15970259
 ]


HanCheol Cho edited comment on SPARK-20336 at 4/17/17 1:18 AM:
---------------------------------------------------------------

Hi, [~hyukjin.kwon] 

I found that this case only happens when I run it in Yarn mode, not local mode, 
and the cluster used here were using different Python version, Anaconda Python 
2.7.11 in a client node and System's Python 2.7.5 in worker nodes.
Other system configurations such as locale (en_us.UTF-8) were same.

However, I am not yet sure if this is the root cause or not.
I will test it once again by updating Cluster's Python, but it will take some 
time since other team members also use it.
I think I can make additional reports during next week. Would it be okay?




was (Author: priancho):
Hi, [~hyukjin.kwon] 

I found that this case only happens when I run it in Yarn mode, not local mode, 
and the clused used here were using different Python version (Anaconda Python 
2.7.11 in client and System's Python 2.7.5 in worker nodes).
Other system configurations such as locale (en_us.UTF-8) were same.

However, I am not yet sure if this is the root cause or not.
I will test it once agin by updating Cluster's Python.
But it will take some time since other team members also use it.
I think I can make additional reports during next week. Would it be okay?



> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-20336
>                 URL: https://issues.apache.org/jira/browse/SPARK-20336
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>         Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>            Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> +----+----+----+
> |col1|col2|col3|
> +----+----+----+
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> +----+----+----+
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> +----+----+--------------------+
> |col1|col2|                col3|
> +----+----+--------------------+
> |   1|   a|                text|
> |   2|   b|        ������������|
> |   3|   c|           ���������|
> |   4|   d|text
> ������������...|
> |   5|   e|                last|
> +----+----+--------------------+
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

Reply via email to