[ https://issues.apache.org/jira/browse/SPARK-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Armbrust resolved SPARK-2117. ------------------------------------- Resolution: Won't Fix i realized this was actually an issue with the CSV reader. > UTF8 Characters Break PySpark > ----------------------------- > > Key: SPARK-2117 > URL: https://issues.apache.org/jira/browse/SPARK-2117 > Project: Spark > Issue Type: Bug > Components: PySpark > Reporter: Michael Armbrust > Assignee: Ahir Reddy > > Here is a short reproduction: > {code} > import csv > sc.parallelize([u'\u5c06\u5185\u5bb9\u8bbe\u8ba1\u4e0e\u6f14\u7ece\u5f97\u66f4\u52a0\u5b9e\u6548\u751f\u52a8\uff0c\u6709\u6548\u7275\u52a8\u6bcf\u4f4d\u53c2\u52a0\u5b66\u5458\u7684\u5fc3\u5f26\uff1b']).mapPartitions(lambda > iter: csv.reader(iter)).collect() > {code} > Here's the error: > {code} > Py4JJavaError: An error occurred while calling o620.collect. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > 67.0:71 failed 4 times, most recent failure: Exception failure in TID 2310 on > host ip-10-0-184-51.ec2.internal: > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/home/ubuntu/databricks/spark/python/pyspark/worker.py", line 77, in > main > serializer.dump_stream(func(split_index, iterator), outfile) > File "/home/ubuntu/databricks/spark/python/pyspark/serializers.py", line > 191, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/home/ubuntu/databricks/spark/python/pyspark/serializers.py", line > 123, in dump_stream > for obj in iterator: > File "/home/ubuntu/databricks/spark/python/pyspark/serializers.py", line > 180, in _batched > for item in iterator: > UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-29: > ordinal not in range(128) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)