Re: pyspark is crashing in this case. why?
Adding group back. FYI Geneis - this was on a m3.xlarge with all default settings in Spark. I used Spark version 1.3.0. The 2nd case did work for me: a = [1,2,3,4,5,6,7,8,9] b = [] for x in range(100): ... b.append(a) ... rdd1 = sc.parallelize(b) rdd1.first() 14/12/15 16:33:01 WARN TaskSetManager: Stage 1 contains a task of very large size (9766 KB). The maximum recommended task size is 100 KB. [1, 2, 3, 4, 5, 6, 7, 8, 9] On Mon, Dec 15, 2014 at 1:33 PM, Sameer Farooqui same...@databricks.com wrote: Hi Genesis, The 2nd case did work for me: a = [1,2,3,4,5,6,7,8,9] b = [] for x in range(100): ... b.append(a) ... rdd1 = sc.parallelize(b) rdd1.first() 14/12/15 16:33:01 WARN TaskSetManager: Stage 1 contains a task of very large size (9766 KB). The maximum recommended task size is 100 KB. [1, 2, 3, 4, 5, 6, 7, 8, 9] On Sun, Dec 14, 2014 at 2:13 PM, Genesis Fatum genesis.fa...@gmail.com wrote: Hi Sameer, I have tried multiple configurations. For example, executor and driver memory at 2G. Also played with the JRE memory size parameters (-Xms) and get the same error. Does it work for you? I think it is a setup issue on my side, although I have tried a couple laptops. Thanks On Sun, Dec 14, 2014 at 1:11 PM, Sameer Farooqui same...@databricks.com wrote: How much executor-memory are you setting for the JVM? What about the Driver JVM memory? Also check the Windows Event Log for Out of memory errors for one of the 2 above JVMs. On Dec 14, 2014 6:04 AM, genesis fatum genesis.fa...@gmail.com wrote: Hi, My environment is: standalone spark 1.1.1 on windows 8.1 pro. The following case works fine: a = [1,2,3,4,5,6,7,8,9] b = [] for x in range(10): ... b.append(a) ... rdd1 = sc.parallelize(b) rdd1.first() [1, 2, 3, 4, 5, 6, 7, 8, 9] The following case does not work. The only difference is the size of the array. Note the loop range: 100K vs. 1M. a = [1,2,3,4,5,6,7,8,9] b = [] for x in range(100): ... b.append(a) ... rdd1 = sc.parallelize(b) rdd1.first() 14/12/14 07:52:19 ERROR PythonRDD: Python worker exited unexpectedly (crashed) java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(Unknown Source) at java.net.SocketOutputStream.write(Unknown Source) at java.io.BufferedOutputStream.flushBuffer(Unknown Source) at java.io.BufferedOutputStream.write(Unknown Source) at java.io.DataOutputStream.write(Unknown Source) at java.io.FilterOutputStream.write(Unknown Source) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 1.apply(PythonRDD.scala:341) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 1.apply(PythonRDD.scala:339) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD D.scala:339) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly$mcV$sp(PythonRDD.scala:209) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly(PythonRDD.scala:184) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly(PythonRDD.scala:184) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal a:183) What I have tried: 1. Replaced JRE 32bit with JRE64 2. Multiple configurations when I start pyspark: --driver-memory, --executor-memory 3. Tried to set the SparkConf with different settings 4. Tried also with spark 1.1.0 Being new to Spark, I am sure that it is something simple that I am missing and would appreciate any thoughts. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-is-crashing-in-this-case-why-tp20675.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
pyspark is crashing in this case. why?
Hi, My environment is: standalone spark 1.1.1 on windows 8.1 pro. The following case works fine: a = [1,2,3,4,5,6,7,8,9] b = [] for x in range(10): ... b.append(a) ... rdd1 = sc.parallelize(b) rdd1.first() [1, 2, 3, 4, 5, 6, 7, 8, 9] The following case does not work. The only difference is the size of the array. Note the loop range: 100K vs. 1M. a = [1,2,3,4,5,6,7,8,9] b = [] for x in range(100): ... b.append(a) ... rdd1 = sc.parallelize(b) rdd1.first() 14/12/14 07:52:19 ERROR PythonRDD: Python worker exited unexpectedly (crashed) java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(Unknown Source) at java.net.SocketOutputStream.write(Unknown Source) at java.io.BufferedOutputStream.flushBuffer(Unknown Source) at java.io.BufferedOutputStream.write(Unknown Source) at java.io.DataOutputStream.write(Unknown Source) at java.io.FilterOutputStream.write(Unknown Source) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 1.apply(PythonRDD.scala:341) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 1.apply(PythonRDD.scala:339) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD D.scala:339) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly$mcV$sp(PythonRDD.scala:209) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly(PythonRDD.scala:184) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly(PythonRDD.scala:184) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal a:183) What I have tried: 1. Replaced JRE 32bit with JRE64 2. Multiple configurations when I start pyspark: --driver-memory, --executor-memory 3. Tried to set the SparkConf with different settings 4. Tried also with spark 1.1.0 Being new to Spark, I am sure that it is something simple that I am missing and would appreciate any thoughts. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-is-crashing-in-this-case-why-tp20675.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: pyspark is crashing in this case. why?
How much executor-memory are you setting for the JVM? What about the Driver JVM memory? Also check the Windows Event Log for Out of memory errors for one of the 2 above JVMs. On Dec 14, 2014 6:04 AM, genesis fatum genesis.fa...@gmail.com wrote: Hi, My environment is: standalone spark 1.1.1 on windows 8.1 pro. The following case works fine: a = [1,2,3,4,5,6,7,8,9] b = [] for x in range(10): ... b.append(a) ... rdd1 = sc.parallelize(b) rdd1.first() [1, 2, 3, 4, 5, 6, 7, 8, 9] The following case does not work. The only difference is the size of the array. Note the loop range: 100K vs. 1M. a = [1,2,3,4,5,6,7,8,9] b = [] for x in range(100): ... b.append(a) ... rdd1 = sc.parallelize(b) rdd1.first() 14/12/14 07:52:19 ERROR PythonRDD: Python worker exited unexpectedly (crashed) java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(Unknown Source) at java.net.SocketOutputStream.write(Unknown Source) at java.io.BufferedOutputStream.flushBuffer(Unknown Source) at java.io.BufferedOutputStream.write(Unknown Source) at java.io.DataOutputStream.write(Unknown Source) at java.io.FilterOutputStream.write(Unknown Source) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 1.apply(PythonRDD.scala:341) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 1.apply(PythonRDD.scala:339) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD D.scala:339) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly$mcV$sp(PythonRDD.scala:209) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly(PythonRDD.scala:184) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly(PythonRDD.scala:184) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal a:183) What I have tried: 1. Replaced JRE 32bit with JRE64 2. Multiple configurations when I start pyspark: --driver-memory, --executor-memory 3. Tried to set the SparkConf with different settings 4. Tried also with spark 1.1.0 Being new to Spark, I am sure that it is something simple that I am missing and would appreciate any thoughts. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-is-crashing-in-this-case-why-tp20675.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org