[ 
https://issues.apache.org/jira/browse/SPARK-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077193#comment-14077193
 ] 

Apache Spark commented on SPARK-2580:
-------------------------------------

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/1625

> broken pipe collecting schemardd results
> ----------------------------------------
>
>                 Key: SPARK-2580
>                 URL: https://issues.apache.org/jira/browse/SPARK-2580
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.0.0
>         Environment: fedora 21 local and rhel 7 clustered (standalone)
>            Reporter: Matthew Farrellee
>            Assignee: Davies Liu
>              Labels: py4j, pyspark
>
> {code}
> from pyspark.sql import SQLContext
> sqlCtx = SQLContext(sc)
> # size of cluster impacts where this breaks (i.e. 2**15 vs 2**2)
> data = sc.parallelize([{'name': 'index', 'value': 0}] * 2**20)
> sdata = sqlCtx.inferSchema(data)
> sdata.first()
> {code}
> result: note - result returned as well as error
> {code}
> >>> sdata.first()
> 14/07/18 12:10:25 INFO SparkContext: Starting job: runJob at 
> PythonRDD.scala:290
> 14/07/18 12:10:25 INFO DAGScheduler: Got job 43 (runJob at 
> PythonRDD.scala:290) with 1 output partitions (allowLocal=true)
> 14/07/18 12:10:25 INFO DAGScheduler: Final stage: Stage 52(runJob at 
> PythonRDD.scala:290)
> 14/07/18 12:10:25 INFO DAGScheduler: Parents of final stage: List()
> 14/07/18 12:10:25 INFO DAGScheduler: Missing parents: List()
> 14/07/18 12:10:25 INFO DAGScheduler: Computing the requested partition locally
> 14/07/18 12:10:25 INFO PythonRDD: Times: total = 45, boot = 3, init = 40, 
> finish = 2
> 14/07/18 12:10:25 INFO SparkContext: Job finished: runJob at 
> PythonRDD.scala:290, took 0.048348426 s
> {u'name': u'index', u'value': 0}
> >>> PySpark worker failed with exception:
> Traceback (most recent call last):
>   File 
> "/home/matt/Documents/Repositories/spark/dist/python/pyspark/worker.py", line 
> 77, in main
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", 
> line 191, in dump_stream
>     self.serializer.dump_stream(self._batched(iterator), stream)
>   File 
> "/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", 
> line 124, in dump_stream
>     self._write_with_length(obj, stream)
>   File 
> "/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", 
> line 139, in _write_with_length
>     stream.write(serialized)
> IOError: [Errno 32] Broken pipe
> Traceback (most recent call last):
>   File 
> "/home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py", line 
> 130, in launch_worker
>     worker(listen_sock)
>   File 
> "/home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py", line 
> 119, in worker
>     outfile.flush()
> IOError: [Errno 32] Broken pipe
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to