[jira] [Comment Edited] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

HanCheol Cho (JIRA) Mon, 17 Apr 2017 03:33:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970944#comment-15970944
 ]


HanCheol Cho edited comment on SPARK-20336 at 4/17/17 10:31 AM:
----------------------------------------------------------------

This time, I checked whether PySpark uses the same version of Python in both 
client and worker nodes first with the following code.
The result shows that all servers are using the same one.

{code}
$ pyspark --master yarn --num-executors 12

# client
import sys
print sys.version
Python 2.7.13 :: Anaconda custom (64-bit)

# workers
rdd = spark.sparkContext.parallelize(range(100), 16)
def get_hostname_and_python_version():
    import socket, sys
    return socket.gethostname() + " --- " + sys.version

res = rdd.map(lambda r: get_hostname_and_python_version()) 
for item in set(res.collect()):
    print item
slave01 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)  
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave05 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave06 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave02 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave03 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave04 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
{code}


Unfortunately, the test result was same. spark.read.csv() with wholeFile=True 
option fails to read non-ASCII text correctly.
An interesting point is that the broken non-ASCII characters shown as � are 
actually a series of \ufffd. As I understand, it appears when input characters 
cannot be decoded with a given codec.
In addtion, testing with spark-shell also showed the same results (works in 
local mode but not in Yarn mode).
Please see the following code snippet for the details.



{code}
#---------------------------------------------#
# Test spark.read.csv() with wholeFile option #
#---------------------------------------------#

$ pyspark --master yarn --num-executors 12

# what is test data?
csv_raw = spark.read.text("test.encoding.csv")
csv_raw.show()
+--------------+
|         value|
+--------------+
|col1,col2,col3|
|      1,a,text|
|      2,b,テキスト|
|       3,c,텍스트|
|     4,d,"text|
|          テキスト|
|          텍스트"|
|      5,e,last|
+--------------+

# loading csv in one-record-per-line fashion
csv_default = spark.read.csv("test.encoding.csv", header=True)
csv_default.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|text|
|   2|   b|テキスト|
|   3|   c| 텍스트|
|   4|   d|text|
|テキスト|null|null|
|텍스트"|null|null|
|   5|   e|last|
+----+----+----+

# loading csv in wholeFile mode
csv_wholefile = spark.read.csv("test.encoding.csv", header=True, wholeFile=True)
csv_wholefile.show()
+----+----+--------------------+
|col1|col2|                col3|
+----+----+--------------------+
|   1|   a|                text|
|   2|   b|        ������������|
|   3|   c|           ���������|
|   4|   d|text
������������...|
|   5|   e|                last|
+----+----+--------------------+

csv_wholefile.collect()[3]                                                      
                                  
Row(col1=u'4', col2=u'd', 
col3=u'text\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd')


#-----------------------#
# Test with spark-shell #
#-----------------------#

$ spark-shell --master local[4]
spark.read.option("wholeFile", true).option("header", 
true).csv("test.encoding.csv").show()
+----+----+-------------+
|col1|col2|         col3|
+----+----+-------------+
|   1|   a|         text|
|   2|   b|         テキスト|
|   3|   c|          텍스트|
|   4|   d|text
テキスト
텍스트|
|   5|   e|         last|
+----+----+-------------+

$ spark-shell --num-executors 12 --master yarn
spark.read.option("wholeFile", true).option("header", 
true).csv("test.encoding.csv").show()
+----+----+--------------------+
|col1|col2|                col3|
+----+----+--------------------+
|   1|   a|                text|
|   2|   b|        ������������|
|   3|   c|           ���������|
|   4|   d|text
������������...|
|   5|   e|                last|
+----+----+--------------------+

{code}


The following is the test for spark.read.json() with wholeFile option. It works 
without any proble.
But I would like to point one issue that the semantics of wholeFile option in 
this method
is different from that in spark.read.csv(). (in case of json(), one file must 
have only one record)
I think this can confuse users.

{code}
#---------------------------------------------#
# Test spark.read.json() with wholeFile option #
#---------------------------------------------#

# what is test data?
json_raw = spark.read.text("test.encoding.json")
json_raw.show(truncate=False)
+----------------------------+
|value                       |
+----------------------------+
|{"col1":4,                  |
| "col2":"d",                |
| "col3":"text\nテキスト\n텍스트"}|
|                            |
+----------------------------+

# loading json in one-record-per-line fashion
json_default = spark.read.json("test.encoding.json")
json_default.show(truncate=False)
+----------------------------+
|_corrupt_record             |
+----------------------------+
|{"col1":4,                  |
| "col2":"d",                |
| "col3":"text\nテキスト\n텍스트"}|
+----------------------------+

# loading json in wholeFile mode
json_wholefile = spark.read.json("test.encoding.json", wholeFile=True)
json_wholefile.show()
+----+----+-------------+
|col1|col2|         col3|
+----+----+-------------+
|   4|   d|text
テキスト
텍스트|
+----+----+-------------+
print json_wholefile.take(1)[0].col3
text
テキスト
텍스트
{code}


I think that it will be very helpful if some can test Spark-2.2.0 in Yarn mode 
to make it sure whether this is server configuration problem, not Spark problem.




was (Author: priancho):
This time, I checked whether PySpark uses the same version of Python in both 
client and worker nodes first with the following code.
The result shows that all servers are using the same one.

{code}
$ pyspark --master yarn --num-executors 12

# client
import sys
print sys.version
Python 2.7.13 :: Anaconda custom (64-bit)

# workers
rdd = spark.sparkContext.parallelize(range(100), 16)
def get_hostname_and_python_version():
    import socket, sys
    return socket.gethostname() + " --- " + sys.version

res = rdd.map(lambda r: get_hostname_and_python_version()) 
for item in set(res.collect()):
    print item
slave01 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)  
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave05 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave06 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave02 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave03 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave04 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
{code}


Unfortunately, the test result was same. spark.read.csv() with wholeFile=True 
option fails to read non-ASCII text correctly.
An interesting point is that the broken non-ASCII characters shown as � are 
actually a series of \ufffd. As I understand, it appears when input characters 
cannot be decoded with a given codec.
In addtion, testing with spark-shell also showed the same results (works in 
local mode but not in Yarn mode).
Please see the following code snippet for the details.



{code}
#---------------------------------------------#
# Test spark.read.csv() with wholeFile option #
#---------------------------------------------#

$ pyspark --master yarn --num-executors 12

# what is test data?
csv_raw = spark.read.text("test.encoding.csv")
csv_raw.show()
+--------------+
|         value|
+--------------+
|col1,col2,col3|
|      1,a,text|
|      2,b,テキスト|
|       3,c,텍스트|
|     4,d,"text|
|          テキスト|
|          텍스트"|
|      5,e,last|
+--------------+

# loading csv in one-record-per-line fashion
csv_default = spark.read.csv("test.encoding.csv", header=True)
csv_default.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|text|
|   2|   b|テキスト|
|   3|   c| 텍스트|
|   4|   d|text|
|テキスト|null|null|
|텍스트"|null|null|
|   5|   e|last|
+----+----+----+

# loading csv in wholeFile mode
csv_wholefile = spark.read.csv("test.encoding.csv", header=True, wholeFile=True)
csv_wholefile.show()
+----+----+--------------------+
|col1|col2|                col3|
+----+----+--------------------+
|   1|   a|                text|
|   2|   b|        ������������|
|   3|   c|           ���������|
|   4|   d|text
������������...|
|   5|   e|                last|
+----+----+--------------------+

csv_wholefile.collect()[3]                                                      
                                  
Row(col1=u'4', col2=u'd', 
col3=u'text\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd')


#-----------------------#
# Test with spark-shell #
#-----------------------#

$ spark-shell --master local[4]
spark.read.option("wholeFile", true).option("header", 
true).csv("test.encoding.csv").show()
+----+----+-------------+
|col1|col2|         col3|
+----+----+-------------+
|   1|   a|         text|
|   2|   b|         テキスト|
|   3|   c|          텍스트|
|   4|   d|text
テキスト
텍스트|
|   5|   e|         last|
+----+----+-------------+

$ spark-shell --num-executors 12 --master yarn
spark.read.option("wholeFile", true).option("header", 
true).csv("test.encoding.csv").show()
+----+----+--------------------+
|col1|col2|                col3|
+----+----+--------------------+
|   1|   a|                text|
|   2|   b|        ������������|
|   3|   c|           ���������|
|   4|   d|text
������������...|
|   5|   e|                last|
+----+----+--------------------+

{code}


The following is the test for spark.read.json() with wholeFile option. It works 
without any proble.
But I would like to point one issue that the semantics of wholeFile option in 
this method
is different from that in spark.read.csv(). (in case of json(), one file must 
have only one record)
I think this can confuse users.

{code}
#---------------------------------------------#
# Test spark.read.json() with wholeFile option #
#---------------------------------------------#

# what is test data?
json_raw = spark.read.text("test.encoding.json")
json_raw.show(truncate=False)
+----------------------------+
|value                       |
+----------------------------+
|{"col1":4,                  |
| "col2":"d",                |
| "col3":"text\nテキスト\n텍스트"}|
|                            |
+----------------------------+

# loading json in one-record-per-line fashion
json_default = spark.read.json("test.encoding.json")
json_default.show(truncate=False)
+----------------------------+
|_corrupt_record             |
+----------------------------+
|{"col1":4,                  |
| "col2":"d",                |
| "col3":"text\nテキスト\n텍스트"}|
+----------------------------+

# loading json in wholeFile mode
json_wholefile = spark.read.json("test.encoding.json", wholeFile=True)
json_wholefile.show()
+----+----+-------------+
|col1|col2|         col3|
+----+----+-------------+
|   4|   d|text
テキスト
텍스트|
+----+----+-------------+
print json_wholefile.take(1)[0].col3
text
テキスト
텍스트
{code}


> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-20336
>                 URL: https://issues.apache.org/jira/browse/SPARK-20336
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>         Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>            Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> +----+----+----+
> |col1|col2|col3|
> +----+----+----+
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> +----+----+----+
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> +----+----+--------------------+
> |col1|col2|                col3|
> +----+----+--------------------+
> |   1|   a|                text|
> |   2|   b|        ������������|
> |   3|   c|           ���������|
> |   4|   d|text
> ������������...|
> |   5|   e|                last|
> +----+----+--------------------+
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

Reply via email to