Michael Armbrust created SPARK-2117:
---------------------------------------
Summary: UTF8 Characters Break PySpark
Key: SPARK-2117
URL: https://issues.apache.org/jira/browse/SPARK-2117
Project: Spark
Issue Type: Bug
Components: PySpark
Reporter: Michael Armbrust
Assignee: Ahir Reddy
Here is a short reproduction:
{code}
import csv
sc.parallelize([u'\u5c06\u5185\u5bb9\u8bbe\u8ba1\u4e0e\u6f14\u7ece\u5f97\u66f4\u52a0\u5b9e\u6548\u751f\u52a8\uff0c\u6709\u6548\u7275\u52a8\u6bcf\u4f4d\u53c2\u52a0\u5b66\u5458\u7684\u5fc3\u5f26\uff1b']).mapPartitions(lambda
iter: csv.reader(iter)).collect()
{code}
Here's the error:
{code}
Py4JJavaError: An error occurred while calling o620.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
67.0:71 failed 4 times, most recent failure: Exception failure in TID 2310 on
host ip-10-0-184-51.ec2.internal: org.apache.spark.api.python.PythonException:
Traceback (most recent call last):
File "/home/ubuntu/databricks/spark/python/pyspark/worker.py", line 77, in
main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/ubuntu/databricks/spark/python/pyspark/serializers.py", line 191,
in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/home/ubuntu/databricks/spark/python/pyspark/serializers.py", line 123,
in dump_stream
for obj in iterator:
File "/home/ubuntu/databricks/spark/python/pyspark/serializers.py", line 180,
in _batched
for item in iterator:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-29:
ordinal not in range(128)
{code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)