[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HanCheol Cho updated SPARK-20336:
---------------------------------
    Description: 
I used spark.read.csv() method with wholeFile=True option to load data that has 
multi-line records.
However, non-ASCII characters are not properly loaded.

The following is a sample data for test:
{code:none}
col1,col2,col3
1,a,text
2,b,テキスト
3,c,텍스트
4,d,"text
テキスト
텍스트"
5,e,last
{code}

When it is loaded without wholeFile=True option, non-ASCII characters are shown 
correctly although multi-line records are parsed incorrectly as follows:
{code:none}
testdf_default = spark.read.csv("test.encoding.csv", header=True)
testdf_default.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|text|
|   2|   b|テキスト|
|   3|   c| 텍스트|
|   4|   d|text|
|テキスト|null|null|
| 텍스트"|null|null|
|   5|   e|last|
+----+----+----+
{code}

When wholeFile=True option is used, non-ASCII characters are broken as follows:
{code:none}
testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
wholeFile=True)
testdf_wholefile.show()
+----+----+--------------------+
|col1|col2|                col3|
+----+----+--------------------+
|   1|   a|                text|
|   2|   b|        ������������|
|   3|   c|           ���������|
|   4|   d|text
������������...|
|   5|   e|                last|
+----+----+--------------------+
{code}

The result is same even if I use encoding="UTF-8" option with wholeFile=True.




  was:
I used spark.read.csv() method with wholeFile=True option to load data that has 
multi-line records.
However, non-ASCII characters are not properly loaded.

The following is a sample data for test:
{code:none}
col1,col2,col3
1,a,text
2,b,テキスト
3,c,텍스트
4,d,"text
テキスト
텍스트
5,e,last
{code}

When it is loaded without wholeFile=True option, non-ASCII characters are shown 
correctly although multi-line records are parsed incorrectly as follows:
{code:none}
testdf_default = spark.read.csv("test.encoding.csv", header=True)
testdf_default.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|text|
|   2|   b|テキスト|
|   3|   c| 텍스트|
|   4|   d|text|
|テキスト|null|null|
| 텍스트|null|null|
|   5|   e|last|
+----+----+----+
{code}

When wholeFile=True option is used, non-ASCII characters are broken as follows:
{code:none}
testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
wholeFile=True)
testdf_wholefile.show()
+----+----+--------------------+
|col1|col2|                col3|
+----+----+--------------------+
|   1|   a|                text|
|   2|   b|        ������������|
|   3|   c|           ���������|
|   4|   d|text
������������...|
+----+----+--------------------+
{code}

The result is same even if I use encoding="UTF-8" option with wholeFile=True.





> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-20336
>                 URL: https://issues.apache.org/jira/browse/SPARK-20336
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>         Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>            Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> +----+----+----+
> |col1|col2|col3|
> +----+----+----+
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> +----+----+----+
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> +----+----+--------------------+
> |col1|col2|                col3|
> +----+----+--------------------+
> |   1|   a|                text|
> |   2|   b|        ������������|
> |   3|   c|           ���������|
> |   4|   d|text
> ������������...|
> |   5|   e|                last|
> +----+----+--------------------+
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to