Re: pyspark is crashing in this case. why?

2014-12-15 Thread Sameer Farooqui
Adding group back.


FYI Geneis - this was on a m3.xlarge with all default settings in Spark. I
used Spark version 1.3.0.

The 2nd case did work for me:

 a = [1,2,3,4,5,6,7,8,9]
 b = []
 for x in range(100):
...   b.append(a)
...
 rdd1 = sc.parallelize(b)
 rdd1.first()
14/12/15 16:33:01 WARN TaskSetManager: Stage 1 contains a task of very
large size (9766 KB). The maximum recommended task size is 100 KB.
[1, 2, 3, 4, 5, 6, 7, 8, 9]


On Mon, Dec 15, 2014 at 1:33 PM, Sameer Farooqui same...@databricks.com
wrote:

 Hi Genesis,


 The 2nd case did work for me:

  a = [1,2,3,4,5,6,7,8,9]
  b = []
  for x in range(100):
 ...   b.append(a)
 ...
  rdd1 = sc.parallelize(b)
  rdd1.first()
 14/12/15 16:33:01 WARN TaskSetManager: Stage 1 contains a task of very
 large size (9766 KB). The maximum recommended task size is 100 KB.
 [1, 2, 3, 4, 5, 6, 7, 8, 9]




 On Sun, Dec 14, 2014 at 2:13 PM, Genesis Fatum genesis.fa...@gmail.com
 wrote:

 Hi Sameer,

 I have tried multiple configurations. For example, executor and driver
 memory at 2G. Also played with the JRE memory size parameters (-Xms) and
 get the same error.

 Does it work for you? I think it is a setup issue on my side, although I
 have tried a couple laptops.

 Thanks

 On Sun, Dec 14, 2014 at 1:11 PM, Sameer Farooqui same...@databricks.com
 wrote:

 How much executor-memory are you setting for the JVM? What about the
 Driver JVM memory?

 Also check the Windows Event Log for Out of memory errors for one of the
 2 above JVMs.
 On Dec 14, 2014 6:04 AM, genesis fatum genesis.fa...@gmail.com
 wrote:

 Hi,

 My environment is: standalone spark 1.1.1 on windows 8.1 pro.

 The following case works fine:
  a = [1,2,3,4,5,6,7,8,9]
  b = []
  for x in range(10):
 ...  b.append(a)
 ...
  rdd1 = sc.parallelize(b)
  rdd1.first()
 [1, 2, 3, 4, 5, 6, 7, 8, 9]

 The following case does not work. The only difference is the size of the
 array. Note the loop range: 100K vs. 1M.
  a = [1,2,3,4,5,6,7,8,9]
  b = []
  for x in range(100):
 ...  b.append(a)
 ...
  rdd1 = sc.parallelize(b)
  rdd1.first()
 
 14/12/14 07:52:19 ERROR PythonRDD: Python worker exited unexpectedly
 (crashed)
 java.net.SocketException: Connection reset by peer: socket write error
 at java.net.SocketOutputStream.socketWrite0(Native Method)
 at java.net.SocketOutputStream.socketWrite(Unknown Source)
 at java.net.SocketOutputStream.write(Unknown Source)
 at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
 at java.io.BufferedOutputStream.write(Unknown Source)
 at java.io.DataOutputStream.write(Unknown Source)
 at java.io.FilterOutputStream.write(Unknown Source)
 at
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
 1.apply(PythonRDD.scala:341)
 at
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
 1.apply(PythonRDD.scala:339)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD
 D.scala:339)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
 ly$mcV$sp(PythonRDD.scala:209)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
 ly(PythonRDD.scala:184)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
 ly(PythonRDD.scala:184)
 at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal
 a:183)

 What I have tried:
 1. Replaced JRE 32bit with JRE64
 2. Multiple configurations when I start pyspark: --driver-memory,
 --executor-memory
 3. Tried to set the SparkConf with different settings
 4. Tried also with spark 1.1.0

 Being new to Spark, I am sure that it is something simple that I am
 missing
 and would appreciate any thoughts.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-is-crashing-in-this-case-why-tp20675.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




pyspark is crashing in this case. why?

2014-12-14 Thread genesis fatum
Hi,

My environment is: standalone spark 1.1.1 on windows 8.1 pro.

The following case works fine:
 a = [1,2,3,4,5,6,7,8,9]
 b = []
 for x in range(10):
...  b.append(a)
...
 rdd1 = sc.parallelize(b)
 rdd1.first()
[1, 2, 3, 4, 5, 6, 7, 8, 9]

The following case does not work. The only difference is the size of the
array. Note the loop range: 100K vs. 1M.
 a = [1,2,3,4,5,6,7,8,9]
 b = []
 for x in range(100):
...  b.append(a)
...
 rdd1 = sc.parallelize(b)
 rdd1.first()

14/12/14 07:52:19 ERROR PythonRDD: Python worker exited unexpectedly
(crashed)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(Unknown Source)
at java.net.SocketOutputStream.write(Unknown Source)
at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
at java.io.BufferedOutputStream.write(Unknown Source)
at java.io.DataOutputStream.write(Unknown Source)
at java.io.FilterOutputStream.write(Unknown Source)
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
1.apply(PythonRDD.scala:341)
at
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
1.apply(PythonRDD.scala:339)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD
D.scala:339)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
ly$mcV$sp(PythonRDD.scala:209)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
ly(PythonRDD.scala:184)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
ly(PythonRDD.scala:184)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364)
at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal
a:183)

What I have tried:
1. Replaced JRE 32bit with JRE64 
2. Multiple configurations when I start pyspark: --driver-memory,
--executor-memory
3. Tried to set the SparkConf with different settings
4. Tried also with spark 1.1.0

Being new to Spark, I am sure that it is something simple that I am missing
and would appreciate any thoughts.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-is-crashing-in-this-case-why-tp20675.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark is crashing in this case. why?

2014-12-14 Thread Sameer Farooqui
How much executor-memory are you setting for the JVM? What about the Driver
JVM memory?

Also check the Windows Event Log for Out of memory errors for one of the 2
above JVMs.
On Dec 14, 2014 6:04 AM, genesis fatum genesis.fa...@gmail.com wrote:

 Hi,

 My environment is: standalone spark 1.1.1 on windows 8.1 pro.

 The following case works fine:
  a = [1,2,3,4,5,6,7,8,9]
  b = []
  for x in range(10):
 ...  b.append(a)
 ...
  rdd1 = sc.parallelize(b)
  rdd1.first()
 [1, 2, 3, 4, 5, 6, 7, 8, 9]

 The following case does not work. The only difference is the size of the
 array. Note the loop range: 100K vs. 1M.
  a = [1,2,3,4,5,6,7,8,9]
  b = []
  for x in range(100):
 ...  b.append(a)
 ...
  rdd1 = sc.parallelize(b)
  rdd1.first()
 
 14/12/14 07:52:19 ERROR PythonRDD: Python worker exited unexpectedly
 (crashed)
 java.net.SocketException: Connection reset by peer: socket write error
 at java.net.SocketOutputStream.socketWrite0(Native Method)
 at java.net.SocketOutputStream.socketWrite(Unknown Source)
 at java.net.SocketOutputStream.write(Unknown Source)
 at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
 at java.io.BufferedOutputStream.write(Unknown Source)
 at java.io.DataOutputStream.write(Unknown Source)
 at java.io.FilterOutputStream.write(Unknown Source)
 at
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
 1.apply(PythonRDD.scala:341)
 at
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
 1.apply(PythonRDD.scala:339)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD
 D.scala:339)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
 ly$mcV$sp(PythonRDD.scala:209)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
 ly(PythonRDD.scala:184)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
 ly(PythonRDD.scala:184)
 at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364)
 at
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal
 a:183)

 What I have tried:
 1. Replaced JRE 32bit with JRE64
 2. Multiple configurations when I start pyspark: --driver-memory,
 --executor-memory
 3. Tried to set the SparkConf with different settings
 4. Tried also with spark 1.1.0

 Being new to Spark, I am sure that it is something simple that I am missing
 and would appreciate any thoughts.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-is-crashing-in-this-case-why-tp20675.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org