[ https://issues.apache.org/jira/browse/SPARK-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen resolved SPARK-2580. ------------------------------- Resolution: Fixed Fix Version/s: 1.0.3 1.1.0 Target Version/s: 1.0.2 > broken pipe collecting schemardd results > ---------------------------------------- > > Key: SPARK-2580 > URL: https://issues.apache.org/jira/browse/SPARK-2580 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.0.0 > Environment: fedora 21 local and rhel 7 clustered (standalone) > Reporter: Matthew Farrellee > Assignee: Davies Liu > Labels: py4j, pyspark > Fix For: 1.1.0, 1.0.3 > > > {code} > from pyspark.sql import SQLContext > sqlCtx = SQLContext(sc) > # size of cluster impacts where this breaks (i.e. 2**15 vs 2**2) > data = sc.parallelize([{'name': 'index', 'value': 0}] * 2**20) > sdata = sqlCtx.inferSchema(data) > sdata.first() > {code} > result: note - result returned as well as error > {code} > >>> sdata.first() > 14/07/18 12:10:25 INFO SparkContext: Starting job: runJob at > PythonRDD.scala:290 > 14/07/18 12:10:25 INFO DAGScheduler: Got job 43 (runJob at > PythonRDD.scala:290) with 1 output partitions (allowLocal=true) > 14/07/18 12:10:25 INFO DAGScheduler: Final stage: Stage 52(runJob at > PythonRDD.scala:290) > 14/07/18 12:10:25 INFO DAGScheduler: Parents of final stage: List() > 14/07/18 12:10:25 INFO DAGScheduler: Missing parents: List() > 14/07/18 12:10:25 INFO DAGScheduler: Computing the requested partition locally > 14/07/18 12:10:25 INFO PythonRDD: Times: total = 45, boot = 3, init = 40, > finish = 2 > 14/07/18 12:10:25 INFO SparkContext: Job finished: runJob at > PythonRDD.scala:290, took 0.048348426 s > {u'name': u'index', u'value': 0} > >>> PySpark worker failed with exception: > Traceback (most recent call last): > File > "/home/matt/Documents/Repositories/spark/dist/python/pyspark/worker.py", line > 77, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", > line 191, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File > "/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", > line 124, in dump_stream > self._write_with_length(obj, stream) > File > "/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", > line 139, in _write_with_length > stream.write(serialized) > IOError: [Errno 32] Broken pipe > Traceback (most recent call last): > File > "/home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py", line > 130, in launch_worker > worker(listen_sock) > File > "/home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py", line > 119, in worker > outfile.flush() > IOError: [Errno 32] Broken pipe > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)