Matthew Farrellee created SPARK-2580:
----------------------------------------

             Summary: broken pipe collecting schemardd results
                 Key: SPARK-2580
                 URL: https://issues.apache.org/jira/browse/SPARK-2580
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 1.0.0
         Environment: fedora 21 local and rhel 7 clustered (standalone)
            Reporter: Matthew Farrellee


{code}
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
# size of cluster impacts where this breaks (i.e. 2**15 vs 2**2)
data = sc.parallelize([{'name': 'index', 'value': 0}] * 2**20)
sdata = sqlCtx.inferSchema(data)
sdata.first()
{code}

result: note - result returned as well as error
{code}
>>> sdata.first()
14/07/18 12:10:25 INFO SparkContext: Starting job: runJob at PythonRDD.scala:290
14/07/18 12:10:25 INFO DAGScheduler: Got job 43 (runJob at PythonRDD.scala:290) 
with 1 output partitions (allowLocal=true)
14/07/18 12:10:25 INFO DAGScheduler: Final stage: Stage 52(runJob at 
PythonRDD.scala:290)
14/07/18 12:10:25 INFO DAGScheduler: Parents of final stage: List()
14/07/18 12:10:25 INFO DAGScheduler: Missing parents: List()
14/07/18 12:10:25 INFO DAGScheduler: Computing the requested partition locally
14/07/18 12:10:25 INFO PythonRDD: Times: total = 45, boot = 3, init = 40, 
finish = 2
14/07/18 12:10:25 INFO SparkContext: Job finished: runJob at 
PythonRDD.scala:290, took 0.048348426 s
{u'name': u'index', u'value': 0}
>>> PySpark worker failed with exception:
Traceback (most recent call last):
  File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/worker.py", 
line 77, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", 
line 191, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", 
line 124, in dump_stream
    self._write_with_length(obj, stream)
  File 
"/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", 
line 139, in _write_with_length
    stream.write(serialized)
IOError: [Errno 32] Broken pipe

Traceback (most recent call last):
  File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py", 
line 130, in launch_worker
    worker(listen_sock)
  File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py", 
line 119, in worker
    outfile.flush()
IOError: [Errno 32] Broken pipe
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to