Re: Spark submit OutOfMemory Error in local mode

2017-08-29 Thread muthu
Are you getting OutOfMemory on the driver or on the executor? Typical cause
of OOM in Spark can be due to fewer number of tasks for a job.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-OutOfMemory-Error-in-local-mode-tp29081p29117.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark submit OutOfMemory Error in local mode

2017-08-22 Thread Naga G
Increase the cores, as you're trying to run multiple threads

Sent from Naga iPad

> On Aug 22, 2017, at 3:26 PM, "u...@moosheimer.com" <u...@moosheimer.com> 
> wrote:
> 
> Since you didn't post any concrete information it's hard to give you an 
> advice.
> 
> Try to increase the executor memory (spark.executor.memory).
> If that doesn't help give all the experts in the community a chance to help 
> you by adding more details like version, logfile, source etc
> 
> Mit freundlichen Grüßen / best regards
> Kay-Uwe Moosheimer
> 
>> Am 22.08.2017 um 20:16 schrieb shitijkuls <kulshreshth...@gmail.com>:
>> 
>> Any help here will be appreciated.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-OutOfMemory-Error-in-local-mode-tp29081p29096.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark submit OutOfMemory Error in local mode

2017-08-22 Thread u...@moosheimer.com
Since you didn't post any concrete information it's hard to give you an advice.

Try to increase the executor memory (spark.executor.memory).
If that doesn't help give all the experts in the community a chance to help you 
by adding more details like version, logfile, source etc

Mit freundlichen Grüßen / best regards
Kay-Uwe Moosheimer

> Am 22.08.2017 um 20:16 schrieb shitijkuls <kulshreshth...@gmail.com>:
> 
> Any help here will be appreciated.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-OutOfMemory-Error-in-local-mode-tp29081p29096.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark submit OutOfMemory Error in local mode

2017-08-22 Thread shitijkuls
Any help here will be appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-OutOfMemory-Error-in-local-mode-tp29081p29096.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zoltán Tóth
Hi,

When I execute the Spark ML Logisitc Regression example in pyspark I run
into an OutOfMemory exception. I'm wondering if any of you experienced the
same or has a hint about how to fix this.

The interesting bit is that I only get the exception when I try to write
the result DataFrame into a file. If I only "print" any of the results, it
all works fine.

My Setup:
Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest
nightly build)
Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
-DzincPort=3034

I'm using the default resource setup
15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor
containers, each with 1 cores and 1408 MB memory including 384 MB overhead
15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
capability: )
15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
capability: )

The script I'm executing:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf().setAppName("pysparktest")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import Vector, Vectors

training = sc.parallelize((
  LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
  LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
  LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5

training_df = training.toDF()

from pyspark.ml.classification import LogisticRegression

reg = LogisticRegression()

reg.setMaxIter(10).setRegParam(0.01)
model = reg.fit(training.toDF())

test = sc.parallelize((
  LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
  LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)),
  LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5

out_df = model.transform(test.toDF())

out_df.write.parquet("/tmp/logparquet")

And the command:
spark-submit --master yarn --deploy-mode cluster spark-ml.py

Thanks,
z


Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zoltán Tóth
Aaand, the error! :)

Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread
"org.apache.hadoop.hdfs.PeerCache@4e000abf"
Exception in thread "Thread-7"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Thread-7"
Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread
"LeaseRenewer:r...@docker.rapidminer.com:8020"
Exception in thread "Reporter"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Reporter"
Exception in thread "qtp2115718813-47"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "qtp2115718813-47"

Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"

Log Type: stdout

Log Upload Time: Mon Sep 07 09:03:01 -0400 2015

Log Length: 986

Traceback (most recent call last):
  File "spark-ml.py", line 33, in 
out_df.write.parquet("/tmp/logparquet")
  File 
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/readwriter.py",
line 422, in parquet
  File 
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File 
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/utils.py",
line 36, in deco
  File 
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError



On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth  wrote:

> Hi,
>
> When I execute the Spark ML Logisitc Regression example in pyspark I run
> into an OutOfMemory exception. I'm wondering if any of you experienced the
> same or has a hint about how to fix this.
>
> The interesting bit is that I only get the exception when I try to write
> the result DataFrame into a file. If I only "print" any of the results, it
> all works fine.
>
> My Setup:
> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest
> nightly build)
> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
> -DzincPort=3034
>
> I'm using the default resource setup
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: )
> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: )
>
> The script I'm executing:
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext
>
> conf = SparkConf().setAppName("pysparktest")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
>
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.linalg import Vector, Vectors
>
> training = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>   LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
>   LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
>
> training_df = training.toDF()
>
> from pyspark.ml.classification import LogisticRegression
>
> reg = LogisticRegression()
>
> reg.setMaxIter(10).setRegParam(0.01)
> model = reg.fit(training.toDF())
>
> test = sc.parallelize((
>   LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
>   LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)),
>   LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5
>
> out_df = model.transform(test.toDF())
>
> out_df.write.parquet("/tmp/logparquet")
>
> And the command:
> spark-submit --master yarn --deploy-mode cluster spark-ml.py
>
> Thanks,
> z
>


Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zoltán Zvara
Hey, I'd try to debug, profile ResolvedDataSource. As far as I know, your
write will be performed by the JVM.

On Mon, Sep 7, 2015 at 4:11 PM Tóth Zoltán  wrote:

> Unfortunately I'm getting the same error:
> The other interesting things are that:
>  - the parquet files got actually written to HDFS (also with
> .write.parquet() )
>  - the application gets stuck in the RUNNING state for good even after the
> error is thrown
>
> 15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 19
> 15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 5
> 15/09/07 10:01:12 INFO spark.ContextCleaner: Cleaned accumulator 20
> Exception in thread "Thread-7"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Thread-7"
> Exception in thread "org.apache.hadoop.hdfs.PeerCache@4070d501"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "org.apache.hadoop.hdfs.PeerCache@4070d501"
> Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread 
> "LeaseRenewer:r...@docker.rapidminer.com:8020"
> Exception in thread "Reporter"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Reporter"
> Exception in thread "qtp2134582502-46"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "qtp2134582502-46"
>
>
>
>
> On Mon, Sep 7, 2015 at 3:48 PM, boci  wrote:
>
>> Hi,
>>
>> Can you try to using save method instead of write?
>>
>> ex: out_df.save("path","parquet")
>>
>> b0c1
>>
>>
>> --
>> Skype: boci13, Hangout: boci.b...@gmail.com
>>
>> On Mon, Sep 7, 2015 at 3:35 PM, Zoltán Tóth 
>> wrote:
>>
>>> Aaand, the error! :)
>>>
>>> Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
>>> Exception: java.lang.OutOfMemoryError thrown from the 
>>> UncaughtExceptionHandler in thread 
>>> "org.apache.hadoop.hdfs.PeerCache@4e000abf"
>>> Exception in thread "Thread-7"
>>> Exception: java.lang.OutOfMemoryError thrown from the 
>>> UncaughtExceptionHandler in thread "Thread-7"
>>> Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
>>> Exception: java.lang.OutOfMemoryError thrown from the 
>>> UncaughtExceptionHandler in thread 
>>> "LeaseRenewer:r...@docker.rapidminer.com:8020"
>>> Exception in thread "Reporter"
>>> Exception: java.lang.OutOfMemoryError thrown from the 
>>> UncaughtExceptionHandler in thread "Reporter"
>>> Exception in thread "qtp2115718813-47"
>>> Exception: java.lang.OutOfMemoryError thrown from the 
>>> UncaughtExceptionHandler in thread "qtp2115718813-47"
>>>
>>> Exception: java.lang.OutOfMemoryError thrown from the 
>>> UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"
>>>
>>> Log Type: stdout
>>>
>>> Log Upload Time: Mon Sep 07 09:03:01 -0400 2015
>>>
>>> Log Length: 986
>>>
>>> Traceback (most recent call last):
>>>   File "spark-ml.py", line 33, in 
>>> out_df.write.parquet("/tmp/logparquet")
>>>   File 
>>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/readwriter.py",
>>>  line 422, in parquet
>>>   File 
>>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>>>  line 538, in __call__
>>>   File 
>>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/utils.py",
>>>  line 36, in deco
>>>   File 
>>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>>>  line 300, in get_return_value
>>> py4j.protocol.Py4JJavaError
>>>
>>>
>>>
>>> On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth 
>>> wrote:
>>>
 Hi,

 When I execute the Spark ML Logisitc Regression example in pyspark I
 run into an OutOfMemory exception. I'm wondering if any of you experienced
 the same or has a hint about how to fix this.

 The interesting bit is that I only get the exception when I try to
 write the result DataFrame into a file. If I only "print" any of the
 results, it all works fine.

 My Setup:
 Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the
 latest nightly build)
 Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
 -DzincPort=3034

 I'm using the default resource setup
 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 

Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread boci
Hi,

Can you try to using save method instead of write?

ex: out_df.save("path","parquet")

b0c1

--
Skype: boci13, Hangout: boci.b...@gmail.com

On Mon, Sep 7, 2015 at 3:35 PM, Zoltán Tóth  wrote:

> Aaand, the error! :)
>
> Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
> Exception in thread "Thread-7"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Thread-7"
> Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread 
> "LeaseRenewer:r...@docker.rapidminer.com:8020"
> Exception in thread "Reporter"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Reporter"
> Exception in thread "qtp2115718813-47"
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "qtp2115718813-47"
>
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"
>
> Log Type: stdout
>
> Log Upload Time: Mon Sep 07 09:03:01 -0400 2015
>
> Log Length: 986
>
> Traceback (most recent call last):
>   File "spark-ml.py", line 33, in 
> out_df.write.parquet("/tmp/logparquet")
>   File 
> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/readwriter.py",
>  line 422, in parquet
>   File 
> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/utils.py",
>  line 36, in deco
>   File 
> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError
>
>
>
> On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth  wrote:
>
>> Hi,
>>
>> When I execute the Spark ML Logisitc Regression example in pyspark I run
>> into an OutOfMemory exception. I'm wondering if any of you experienced the
>> same or has a hint about how to fix this.
>>
>> The interesting bit is that I only get the exception when I try to write
>> the result DataFrame into a file. If I only "print" any of the results, it
>> all works fine.
>>
>> My Setup:
>> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest
>> nightly build)
>> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
>> -DzincPort=3034
>>
>> I'm using the default resource setup
>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor
>> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
>> capability: )
>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
>> capability: )
>>
>> The script I'm executing:
>> from pyspark import SparkContext, SparkConf
>> from pyspark.sql import SQLContext
>>
>> conf = SparkConf().setAppName("pysparktest")
>> sc = SparkContext(conf=conf)
>> sqlContext = SQLContext(sc)
>>
>> from pyspark.mllib.regression import LabeledPoint
>> from pyspark.mllib.linalg import Vector, Vectors
>>
>> training = sc.parallelize((
>>   LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
>>   LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
>>   LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
>>   LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
>>
>> training_df = training.toDF()
>>
>> from pyspark.ml.classification import LogisticRegression
>>
>> reg = LogisticRegression()
>>
>> reg.setMaxIter(10).setRegParam(0.01)
>> model = reg.fit(training.toDF())
>>
>> test = sc.parallelize((
>>   LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
>>   LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)),
>>   LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5
>>
>> out_df = model.transform(test.toDF())
>>
>> out_df.write.parquet("/tmp/logparquet")
>>
>> And the command:
>> spark-submit --master yarn --deploy-mode cluster spark-ml.py
>>
>> Thanks,
>> z
>>
>
>


Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Tóth Zoltán
Unfortunately I'm getting the same error:
The other interesting things are that:
 - the parquet files got actually written to HDFS (also with
.write.parquet() )
 - the application gets stuck in the RUNNING state for good even after the
error is thrown

15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 19
15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 5
15/09/07 10:01:12 INFO spark.ContextCleaner: Cleaned accumulator 20
Exception in thread "Thread-7"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Thread-7"
Exception in thread "org.apache.hadoop.hdfs.PeerCache@4070d501"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread
"org.apache.hadoop.hdfs.PeerCache@4070d501"
Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread
"LeaseRenewer:r...@docker.rapidminer.com:8020"
Exception in thread "Reporter"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "Reporter"
Exception in thread "qtp2134582502-46"
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "qtp2134582502-46"




On Mon, Sep 7, 2015 at 3:48 PM, boci  wrote:

> Hi,
>
> Can you try to using save method instead of write?
>
> ex: out_df.save("path","parquet")
>
> b0c1
>
>
> --
> Skype: boci13, Hangout: boci.b...@gmail.com
>
> On Mon, Sep 7, 2015 at 3:35 PM, Zoltán Tóth  wrote:
>
>> Aaand, the error! :)
>>
>> Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread 
>> "org.apache.hadoop.hdfs.PeerCache@4e000abf"
>> Exception in thread "Thread-7"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "Thread-7"
>> Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread 
>> "LeaseRenewer:r...@docker.rapidminer.com:8020"
>> Exception in thread "Reporter"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "Reporter"
>> Exception in thread "qtp2115718813-47"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "qtp2115718813-47"
>>
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"
>>
>> Log Type: stdout
>>
>> Log Upload Time: Mon Sep 07 09:03:01 -0400 2015
>>
>> Log Length: 986
>>
>> Traceback (most recent call last):
>>   File "spark-ml.py", line 33, in 
>> out_df.write.parquet("/tmp/logparquet")
>>   File 
>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/readwriter.py",
>>  line 422, in parquet
>>   File 
>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>>  line 538, in __call__
>>   File 
>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/utils.py",
>>  line 36, in deco
>>   File 
>> "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>>  line 300, in get_return_value
>> py4j.protocol.Py4JJavaError
>>
>>
>>
>> On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth 
>> wrote:
>>
>>> Hi,
>>>
>>> When I execute the Spark ML Logisitc Regression example in pyspark I run
>>> into an OutOfMemory exception. I'm wondering if any of you experienced the
>>> same or has a hint about how to fix this.
>>>
>>> The interesting bit is that I only get the exception when I try to write
>>> the result DataFrame into a file. If I only "print" any of the results, it
>>> all works fine.
>>>
>>> My Setup:
>>> Spark 1.5.0-SNAPSHOT built for Hadoop 2.6.0 (I'm working with the latest
>>> nightly build)
>>> Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
>>> -DzincPort=3034
>>>
>>> I'm using the default resource setup
>>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Will request 2 executor
>>> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
>>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
>>> capability: )
>>> 15/09/07 08:49:04 INFO yarn.YarnAllocator: Container request (host: Any,
>>> capability: )
>>>
>>> 

Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zsolt Tóth
Hi,

I ran your example on Spark-1.4.1 and 1.5.0-rc3. It succeeds on 1.4.1 but
throws the  OOM on 1.5.0.  Do any of you know which PR introduced this
issue?

Zsolt


2015-09-07 16:33 GMT+02:00 Zoltán Zvara :

> Hey, I'd try to debug, profile ResolvedDataSource. As far as I know, your
> write will be performed by the JVM.
>
> On Mon, Sep 7, 2015 at 4:11 PM Tóth Zoltán  wrote:
>
>> Unfortunately I'm getting the same error:
>> The other interesting things are that:
>>  - the parquet files got actually written to HDFS (also with
>> .write.parquet() )
>>  - the application gets stuck in the RUNNING state for good even after
>> the error is thrown
>>
>> 15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 19
>> 15/09/07 10:01:10 INFO spark.ContextCleaner: Cleaned accumulator 5
>> 15/09/07 10:01:12 INFO spark.ContextCleaner: Cleaned accumulator 20
>> Exception in thread "Thread-7"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "Thread-7"
>> Exception in thread "org.apache.hadoop.hdfs.PeerCache@4070d501"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread 
>> "org.apache.hadoop.hdfs.PeerCache@4070d501"
>> Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread 
>> "LeaseRenewer:r...@docker.rapidminer.com:8020"
>> Exception in thread "Reporter"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "Reporter"
>> Exception in thread "qtp2134582502-46"
>> Exception: java.lang.OutOfMemoryError thrown from the 
>> UncaughtExceptionHandler in thread "qtp2134582502-46"
>>
>>
>>
>>
>> On Mon, Sep 7, 2015 at 3:48 PM, boci  wrote:
>>
>>> Hi,
>>>
>>> Can you try to using save method instead of write?
>>>
>>> ex: out_df.save("path","parquet")
>>>
>>> b0c1
>>>
>>>
>>> --
>>> Skype: boci13, Hangout: boci.b...@gmail.com
>>>
>>> On Mon, Sep 7, 2015 at 3:35 PM, Zoltán Tóth 
>>> wrote:
>>>
 Aaand, the error! :)

 Exception in thread "org.apache.hadoop.hdfs.PeerCache@4e000abf"
 Exception: java.lang.OutOfMemoryError thrown from the 
 UncaughtExceptionHandler in thread 
 "org.apache.hadoop.hdfs.PeerCache@4e000abf"
 Exception in thread "Thread-7"
 Exception: java.lang.OutOfMemoryError thrown from the 
 UncaughtExceptionHandler in thread "Thread-7"
 Exception in thread "LeaseRenewer:r...@docker.rapidminer.com:8020"
 Exception: java.lang.OutOfMemoryError thrown from the 
 UncaughtExceptionHandler in thread 
 "LeaseRenewer:r...@docker.rapidminer.com:8020"
 Exception in thread "Reporter"
 Exception: java.lang.OutOfMemoryError thrown from the 
 UncaughtExceptionHandler in thread "Reporter"
 Exception in thread "qtp2115718813-47"
 Exception: java.lang.OutOfMemoryError thrown from the 
 UncaughtExceptionHandler in thread "qtp2115718813-47"

 Exception: java.lang.OutOfMemoryError thrown from the 
 UncaughtExceptionHandler in thread "sparkDriver-scheduler-1"

 Log Type: stdout

 Log Upload Time: Mon Sep 07 09:03:01 -0400 2015

 Log Length: 986

 Traceback (most recent call last):
   File "spark-ml.py", line 33, in 
 out_df.write.parquet("/tmp/logparquet")
   File 
 "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/readwriter.py",
  line 422, in parquet
   File 
 "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
  line 538, in __call__
   File 
 "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/pyspark.zip/pyspark/sql/utils.py",
  line 36, in deco
   File 
 "/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1441224592929_0022/container_1441224592929_0022_01_01/py4j-0.8.2.1-src.zip/py4j/protocol.py",
  line 300, in get_return_value
 py4j.protocol.Py4JJavaError



 On Mon, Sep 7, 2015 at 3:27 PM, Zoltán Tóth 
 wrote:

> Hi,
>
> When I execute the Spark ML Logisitc Regression example in pyspark I
> run into an OutOfMemory exception. I'm wondering if any of you experienced
> the same or has a hint about how to fix this.
>
> The interesting bit is that I only get the exception when I try to
> write the result DataFrame into a file. If I only "print" any of the
> 

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-23 Thread Sourav Chandra
(*
  *  sERROR: Got new count $currentCount  0, value:$values,
  state:$state, resetting to 0*
  *)*
  *currentCount = 0*
  *  }*
  *  // to stop pushing subsequent 0 after receiving first 0*
  *  if (currentCount == 0  previousCount == 0) None*
  *  else Some(previousCount, currentCount)*
  *}*
 
  *trait IConcurrentUsers {*
  *  val count: Long*
  *  def op(a: IConcurrentUsers): IConcurrentUsers =
  IConcurrentUsers.op(this, a)*
  *}*
 
  *object IConcurrentUsers {*
  *  def op(a: IConcurrentUsers, b: IConcurrentUsers): IConcurrentUsers =
  (a, b) match {*
  *case (_, _: ConcurrentViewers) = *
  *  ConcurrentViewers(b.count)*
  *case (_: ConcurrentViewers, _: IncrementConcurrentViewers) = *
  *  ConcurrentViewers(a.count + b.count)*
  *case (_: ConcurrentViewers, _: DecrementConcurrentViewers) = *
  *  ConcurrentViewers(a.count - b.count)*
  *  }*
  *}*
 
  *case class IncrementConcurrentViewers(count: Long) extends
  IConcurrentUsers*
  *case class DecrementConcurrentViewers(count: Long) extends
  IConcurrentUsers*
  *case class ConcurrentViewers(count: Long) extends IConcurrentUsers*
 
 
  *also the error stack trace copied from executor logs is:*
 
  *java.lang.OutOfMemoryError: Java heap space*
  *at
 
 org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)*
  *at
 
 org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2564)*
  *at
  org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)*
  *at
  org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)*
  *at
 
 org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43)*
  *at
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927)*
  *at
 
 org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)*
  *at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)*
  *at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)*
  *at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
  *at java.lang.reflect.Method.invoke(Method.java:601)*
  *at
  java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)*
  *at
  java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)*
  *at
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)*
  *at
  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)*
  *at
  java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)*
  *at
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)*
  *at
 
 org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:236)*
  *at
 
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readObject$1.apply$mcV$sp(TorrentBroadcast.scala:169)*
  *at
 org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927)*
  *at
 
 org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:155)*
  *at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)*
  *at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
  *at java.lang.reflect.Method.invoke(Method.java:601)*
  *at
  java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)*
  *at
  java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)*
  *at
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)*
  *at
  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)*
  *at
 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964)*
  *at
  java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888)*
  *at
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)*
  *at
  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)*
  *15/04/21 15:51:23 ERROR ExecutorUncaughtExceptionHandler: Uncaught
  exception in thread Thread[Executor task launch worker-1,5,main]*
 
 
 
  On Wed, Apr 22, 2015 at 1:32 AM, Olivier Girardot ssab...@gmail.com
  wrote:
 
  Hi Sourav,
  Can you post your updateFunc as well please ?
 
  Regards,
 
  Olivier.
 
  Le mar. 21 avr. 2015 à 12:48, Sourav Chandra 
  sourav.chan...@livestream.com a écrit :
 
  Hi,
 
  We are building a spark streaming application which reads from kafka,
  does updateStateBykey based on the received message type and finally
 stores
  into redis.
 
  After running for few seconds the executor process get killed by
  throwing OutOfMemory error.
 
  The code snippet is below:
 
 
  *NoOfReceiverInstances = 1*
 
  *val kafkaStreams = (1 to NoOfReceiverInstances).map

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-23 Thread Sourav Chandra
 on the received message type and finally
 stores
  into redis.
 
  After running for few seconds the executor process get killed by
  throwing OutOfMemory error.
 
  The code snippet is below:
 
 
  *NoOfReceiverInstances = 1*
 
  *val kafkaStreams = (1 to NoOfReceiverInstances).map(*
  *  _ = KafkaUtils.createStream(ssc, ZKQuorum, ConsumerGroup,
 TopicsMap)*
  *)*
  *val updateFunc = (values: Seq[IConcurrentUsers], state:
 Option[(Long,
  Long)]) = {...}*
 
 
 
 *ssc.union(kafkaStreams).map(KafkaMessageMapper(_)).filter(...)..updateStateByKey(updateFunc).foreachRDD(_.foreachPartition(RedisHelper.update(_)))*
 
 
 
  *object RedisHelper {*
  *  private val client = scredis.Redis(*
  *
 
 ConfigFactory.parseProperties(System.getProperties).getConfig(namespace)*
  *  )*
 
  *  def update(**itr: Iterator[(String, (Long, Long))]) {*
  *// redis save operation*
  *  }*
 
  *}*
 
 
  *Below is the spark configuration:*
 
 
  *spark.app.name http://spark.app.name = XXX*
  *spark.jars = .jar*
  *spark.home = /spark-1.1.1-bin-hadoop2.4*
  *spark.executor.memory = 1g*
  *spark.streaming.concurrentJobs = 1000*
  *spark.logConf = true*
  *spark.cleaner.ttl = 3600 //in milliseconds*
  *spark.default.parallelism = 12*
  *spark.executor.extraJavaOptions = -Xloggc:gc.log
  -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:HeapDumpPath=1.hprof
  -XX:+HeapDumpOnOutOfMemoryError*
  *spark.executor.logs.rolling.strategy = size*
  *spark.executor.logs.rolling.size.maxBytes = 104857600 // 100 MB*
  *spark.executor.logs.rolling.maxRetainedFiles = 10*
  *spark.serializer = org.apache.spark.serializer.KryoSerializer*
  *spark.kryo.registrator = xxx.NoOpKryoRegistrator*
 
 
  other configurations are below
 
  *streaming {*
  *// All streaming context related configs should come here*
  *batch-duration = 1 second*
  *checkpoint-directory = /tmp*
  *checkpoint-duration = 10 seconds*
  *slide-duration = 1 second*
  *window-duration = 1 second*
  *partitions-for-shuffle-task = 32*
  *  }*
  *  kafka {*
  *no-of-receivers = 1*
  *zookeeper-quorum = :2181*
  *consumer-group = x*
  *topic = x:2*
  *  }*

 
  We tried different combinations like
   - with spark 1.1.0 and 1.1.1.
   - by increasing executor memory
   - by changing the serialization strategy (switching between kryo and
  normal java)
   - by changing broadcast strategy (switching between http and torrent
  broadcast)
 
 
  Can anyone give any insight what we are missing here? How can we fix
  this?
 
  Due to akka version mismatch with some other libraries we cannot
 upgrade
  the spark version.
 
  Thanks,
  --
 
  Sourav Chandra
 
  Senior Software Engineer
 
  · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
 
  sourav.chan...@livestream.com
 
  o: +91 80 4121 8723
 
  m: +91 988 699 3746
 
  skype: sourav.chandra
 
  Livestream
 
  Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main,
 3rd
  Block, Koramangala Industrial Area,
 
  Bangalore 560034
 
  www.livestream.com
 
 
 
 
  --
 
  Sourav Chandra
 
  Senior Software Engineer
 
  · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
 
  sourav.chan...@livestream.com
 
  o: +91 80 4121 8723
 
  m: +91 988 699 3746
 
  skype: sourav.chandra
 
  Livestream
 
  Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
  Block, Koramangala Industrial Area,
 
  Bangalore 560034
 
  www.livestream.com
 



 --

 Sourav Chandra

 Senior Software Engineer

 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

 sourav.chan...@livestream.com

 o: +91 80 4121 8723

 m: +91 988 699 3746

 skype: sourav.chandra

 Livestream

 Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
 Block, Koramangala Industrial Area,

 Bangalore 560034

 www.livestream.com





 --

 Sourav Chandra

 Senior Software Engineer

 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

 sourav.chan...@livestream.com

 o: +91 80 4121 8723

 m: +91 988 699 3746

 skype: sourav.chandra

 Livestream

 Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
 Block, Koramangala Industrial Area,

 Bangalore 560034

 www.livestream.com




-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

sourav.chan...@livestream.com

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com


Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Tathagata Das
)*
  *at
 
 org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:155)*
  *at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)*
  *at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)*
  *at java.lang.reflect.Method.invoke(Method.java:601)*
  *at
  java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)*
  *at
  java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)*
  *at
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)*
  *at
  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)*
  *at
  java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964)*
  *at
  java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888)*
  *at
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)*
  *at
  java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)*
  *15/04/21 15:51:23 ERROR ExecutorUncaughtExceptionHandler: Uncaught
  exception in thread Thread[Executor task launch worker-1,5,main]*
 
 
 
  On Wed, Apr 22, 2015 at 1:32 AM, Olivier Girardot ssab...@gmail.com
  wrote:
 
  Hi Sourav,
  Can you post your updateFunc as well please ?
 
  Regards,
 
  Olivier.
 
  Le mar. 21 avr. 2015 à 12:48, Sourav Chandra 
  sourav.chan...@livestream.com a écrit :
 
  Hi,
 
  We are building a spark streaming application which reads from kafka,
  does updateStateBykey based on the received message type and finally
 stores
  into redis.
 
  After running for few seconds the executor process get killed by
  throwing OutOfMemory error.
 
  The code snippet is below:
 
 
  *NoOfReceiverInstances = 1*
 
  *val kafkaStreams = (1 to NoOfReceiverInstances).map(*
  *  _ = KafkaUtils.createStream(ssc, ZKQuorum, ConsumerGroup,
 TopicsMap)*
  *)*
  *val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long,
  Long)]) = {...}*
 
 
 
 *ssc.union(kafkaStreams).map(KafkaMessageMapper(_)).filter(...)..updateStateByKey(updateFunc).foreachRDD(_.foreachPartition(RedisHelper.update(_)))*
 
 
 
  *object RedisHelper {*
  *  private val client = scredis.Redis(*
  *
 
 ConfigFactory.parseProperties(System.getProperties).getConfig(namespace)*
  *  )*
 
  *  def update(**itr: Iterator[(String, (Long, Long))]) {*
  *// redis save operation*
  *  }*
 
  *}*
 
 
  *Below is the spark configuration:*
 
 
  *spark.app.name http://spark.app.name = XXX*
  *spark.jars = .jar*
  *spark.home = /spark-1.1.1-bin-hadoop2.4*
  *spark.executor.memory = 1g*
  *spark.streaming.concurrentJobs = 1000*
  *spark.logConf = true*
  *spark.cleaner.ttl = 3600 //in milliseconds*
  *spark.default.parallelism = 12*
  *spark.executor.extraJavaOptions = -Xloggc:gc.log
  -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:HeapDumpPath=1.hprof
  -XX:+HeapDumpOnOutOfMemoryError*
  *spark.executor.logs.rolling.strategy = size*
  *spark.executor.logs.rolling.size.maxBytes = 104857600 // 100 MB*
  *spark.executor.logs.rolling.maxRetainedFiles = 10*
  *spark.serializer = org.apache.spark.serializer.KryoSerializer*
  *spark.kryo.registrator = xxx.NoOpKryoRegistrator*
 
 
  other configurations are below
 
  *streaming {*
  *// All streaming context related configs should come here*
  *batch-duration = 1 second*
  *checkpoint-directory = /tmp*
  *checkpoint-duration = 10 seconds*
  *slide-duration = 1 second*
  *window-duration = 1 second*
  *partitions-for-shuffle-task = 32*
  *  }*
  *  kafka {*
  *no-of-receivers = 1*
  *zookeeper-quorum = :2181*
  *consumer-group = x*
  *topic = x:2*
  *  }*
 
  We tried different combinations like
   - with spark 1.1.0 and 1.1.1.
   - by increasing executor memory
   - by changing the serialization strategy (switching between kryo and
  normal java)
   - by changing broadcast strategy (switching between http and torrent
  broadcast)
 
 
  Can anyone give any insight what we are missing here? How can we fix
  this?
 
  Due to akka version mismatch with some other libraries we cannot
 upgrade
  the spark version.
 
  Thanks,
  --
 
  Sourav Chandra
 
  Senior Software Engineer
 
  · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
 
  sourav.chan...@livestream.com
 
  o: +91 80 4121 8723
 
  m: +91 988 699 3746
 
  skype: sourav.chandra
 
  Livestream
 
  Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
  Block, Koramangala Industrial Area,
 
  Bangalore 560034
 
  www.livestream.com
 
 
 
 
  --
 
  Sourav Chandra
 
  Senior Software Engineer
 
  · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
 
  sourav.chan...@livestream.com
 
  o: +91 80 4121 8723
 
  m: +91 988 699 3746
 
  skype: sourav.chandra
 
  Livestream
 
  Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
  Block, Koramangala Industrial Area,
 
  Bangalore 560034

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Sourav Chandra
)*
 *at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)*
 *at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)*
 *at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1964)*
 *at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1888)*
 *at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)*
 *at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)*
 *15/04/21 15:51:23 ERROR ExecutorUncaughtExceptionHandler: Uncaught
 exception in thread Thread[Executor task launch worker-1,5,main]*



 On Wed, Apr 22, 2015 at 1:32 AM, Olivier Girardot ssab...@gmail.com
 wrote:

 Hi Sourav,
 Can you post your updateFunc as well please ?

 Regards,

 Olivier.

 Le mar. 21 avr. 2015 à 12:48, Sourav Chandra 
 sourav.chan...@livestream.com a écrit :

 Hi,

 We are building a spark streaming application which reads from kafka,
 does updateStateBykey based on the received message type and finally stores
 into redis.

 After running for few seconds the executor process get killed by
 throwing OutOfMemory error.

 The code snippet is below:


 *NoOfReceiverInstances = 1*

 *val kafkaStreams = (1 to NoOfReceiverInstances).map(*
 *  _ = KafkaUtils.createStream(ssc, ZKQuorum, ConsumerGroup, TopicsMap)*
 *)*
 *val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long,
 Long)]) = {...}*


 *ssc.union(kafkaStreams).map(KafkaMessageMapper(_)).filter(...)..updateStateByKey(updateFunc).foreachRDD(_.foreachPartition(RedisHelper.update(_)))*



 *object RedisHelper {*
 *  private val client = scredis.Redis(*
 *
 ConfigFactory.parseProperties(System.getProperties).getConfig(namespace)*
 *  )*

 *  def update(**itr: Iterator[(String, (Long, Long))]) {*
 *// redis save operation*
 *  }*

 *}*


 *Below is the spark configuration:*


 *spark.app.name http://spark.app.name = XXX*
 *spark.jars = .jar*
 *spark.home = /spark-1.1.1-bin-hadoop2.4*
 *spark.executor.memory = 1g*
 *spark.streaming.concurrentJobs = 1000*
 *spark.logConf = true*
 *spark.cleaner.ttl = 3600 //in milliseconds*
 *spark.default.parallelism = 12*
 *spark.executor.extraJavaOptions = -Xloggc:gc.log
 -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:HeapDumpPath=1.hprof
 -XX:+HeapDumpOnOutOfMemoryError*
 *spark.executor.logs.rolling.strategy = size*
 *spark.executor.logs.rolling.size.maxBytes = 104857600 // 100 MB*
 *spark.executor.logs.rolling.maxRetainedFiles = 10*
 *spark.serializer = org.apache.spark.serializer.KryoSerializer*
 *spark.kryo.registrator = xxx.NoOpKryoRegistrator*


 other configurations are below

 *streaming {*
 *// All streaming context related configs should come here*
 *batch-duration = 1 second*
 *checkpoint-directory = /tmp*
 *checkpoint-duration = 10 seconds*
 *slide-duration = 1 second*
 *window-duration = 1 second*
 *partitions-for-shuffle-task = 32*
 *  }*
 *  kafka {*
 *no-of-receivers = 1*
 *zookeeper-quorum = :2181*
 *consumer-group = x*
 *topic = x:2*
 *  }*

 We tried different combinations like
  - with spark 1.1.0 and 1.1.1.
  - by increasing executor memory
  - by changing the serialization strategy (switching between kryo and
 normal java)
  - by changing broadcast strategy (switching between http and torrent
 broadcast)


 Can anyone give any insight what we are missing here? How can we fix
 this?

 Due to akka version mismatch with some other libraries we cannot upgrade
 the spark version.

 Thanks,
 --

 Sourav Chandra

 Senior Software Engineer

 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

 sourav.chan...@livestream.com

 o: +91 80 4121 8723

 m: +91 988 699 3746

 skype: sourav.chandra

 Livestream

 Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
 Block, Koramangala Industrial Area,

 Bangalore 560034

 www.livestream.com




 --

 Sourav Chandra

 Senior Software Engineer

 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

 sourav.chan...@livestream.com

 o: +91 80 4121 8723

 m: +91 988 699 3746

 skype: sourav.chandra

 Livestream

 Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
 Block, Koramangala Industrial Area,

 Bangalore 560034

 www.livestream.com




-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

sourav.chan...@livestream.com

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com


Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-21 Thread Sourav Chandra
Hi,

We are building a spark streaming application which reads from kafka, does
updateStateBykey based on the received message type and finally stores into
redis.

After running for few seconds the executor process get killed by throwing
OutOfMemory error.

The code snippet is below:


*NoOfReceiverInstances = 1*

*val kafkaStreams = (1 to NoOfReceiverInstances).map(*
*  _ = KafkaUtils.createStream(ssc, ZKQuorum, ConsumerGroup, TopicsMap)*
*)*
*val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long,
Long)]) = {...}*

*ssc.union(kafkaStreams).map(KafkaMessageMapper(_)).filter(...)..updateStateByKey(updateFunc).foreachRDD(_.foreachPartition(RedisHelper.update(_)))*



*object RedisHelper {*
*  private val client = scredis.Redis(*
*
ConfigFactory.parseProperties(System.getProperties).getConfig(namespace)*
*  )*

*  def update(**itr: Iterator[(String, (Long, Long))]) {*
*// redis save operation*
*  }*

*}*


*Below is the spark configuration:*


*spark.app.name http://spark.app.name = XXX*
*spark.jars = .jar*
*spark.home = /spark-1.1.1-bin-hadoop2.4*
*spark.executor.memory = 1g*
*spark.streaming.concurrentJobs = 1000*
*spark.logConf = true*
*spark.cleaner.ttl = 3600 //in milliseconds*
*spark.default.parallelism = 12*
*spark.executor.extraJavaOptions = -Xloggc:gc.log -XX:+PrintGCDetails
-XX:+UseConcMarkSweepGC -XX:HeapDumpPath=1.hprof
-XX:+HeapDumpOnOutOfMemoryError*
*spark.executor.logs.rolling.strategy = size*
*spark.executor.logs.rolling.size.maxBytes = 104857600 // 100 MB*
*spark.executor.logs.rolling.maxRetainedFiles = 10*
*spark.serializer = org.apache.spark.serializer.KryoSerializer*
*spark.kryo.registrator = xxx.NoOpKryoRegistrator*


other configurations are below

*streaming {*
*// All streaming context related configs should come here*
*batch-duration = 1 second*
*checkpoint-directory = /tmp*
*checkpoint-duration = 10 seconds*
*slide-duration = 1 second*
*window-duration = 1 second*
*partitions-for-shuffle-task = 32*
*  }*
*  kafka {*
*no-of-receivers = 1*
*zookeeper-quorum = :2181*
*consumer-group = x*
*topic = x:2*
*  }*

We tried different combinations like
 - with spark 1.1.0 and 1.1.1.
 - by increasing executor memory
 - by changing the serialization strategy (switching between kryo and
normal java)
 - by changing broadcast strategy (switching between http and torrent
broadcast)


Can anyone give any insight what we are missing here? How can we fix this?

Due to akka version mismatch with some other libraries we cannot upgrade
the spark version.

Thanks,
-- 

Sourav Chandra

Senior Software Engineer

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

sourav.chan...@livestream.com

o: +91 80 4121 8723

m: +91 988 699 3746

skype: sourav.chandra

Livestream

Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
Block, Koramangala Industrial Area,

Bangalore 560034

www.livestream.com


Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-21 Thread Olivier Girardot
Hi Sourav,
Can you post your updateFunc as well please ?

Regards,

Olivier.

Le mar. 21 avr. 2015 à 12:48, Sourav Chandra sourav.chan...@livestream.com
a écrit :

 Hi,

 We are building a spark streaming application which reads from kafka, does
 updateStateBykey based on the received message type and finally stores into
 redis.

 After running for few seconds the executor process get killed by throwing
 OutOfMemory error.

 The code snippet is below:


 *NoOfReceiverInstances = 1*

 *val kafkaStreams = (1 to NoOfReceiverInstances).map(*
 *  _ = KafkaUtils.createStream(ssc, ZKQuorum, ConsumerGroup, TopicsMap)*
 *)*
 *val updateFunc = (values: Seq[IConcurrentUsers], state: Option[(Long,
 Long)]) = {...}*


 *ssc.union(kafkaStreams).map(KafkaMessageMapper(_)).filter(...)..updateStateByKey(updateFunc).foreachRDD(_.foreachPartition(RedisHelper.update(_)))*



 *object RedisHelper {*
 *  private val client = scredis.Redis(*
 *
 ConfigFactory.parseProperties(System.getProperties).getConfig(namespace)*
 *  )*

 *  def update(**itr: Iterator[(String, (Long, Long))]) {*
 *// redis save operation*
 *  }*

 *}*


 *Below is the spark configuration:*


 *spark.app.name http://spark.app.name = XXX*
 *spark.jars = .jar*
 *spark.home = /spark-1.1.1-bin-hadoop2.4*
 *spark.executor.memory = 1g*
 *spark.streaming.concurrentJobs = 1000*
 *spark.logConf = true*
 *spark.cleaner.ttl = 3600 //in milliseconds*
 *spark.default.parallelism = 12*
 *spark.executor.extraJavaOptions = -Xloggc:gc.log -XX:+PrintGCDetails
 -XX:+UseConcMarkSweepGC -XX:HeapDumpPath=1.hprof
 -XX:+HeapDumpOnOutOfMemoryError*
 *spark.executor.logs.rolling.strategy = size*
 *spark.executor.logs.rolling.size.maxBytes = 104857600 // 100 MB*
 *spark.executor.logs.rolling.maxRetainedFiles = 10*
 *spark.serializer = org.apache.spark.serializer.KryoSerializer*
 *spark.kryo.registrator = xxx.NoOpKryoRegistrator*


 other configurations are below

 *streaming {*
 *// All streaming context related configs should come here*
 *batch-duration = 1 second*
 *checkpoint-directory = /tmp*
 *checkpoint-duration = 10 seconds*
 *slide-duration = 1 second*
 *window-duration = 1 second*
 *partitions-for-shuffle-task = 32*
 *  }*
 *  kafka {*
 *no-of-receivers = 1*
 *zookeeper-quorum = :2181*
 *consumer-group = x*
 *topic = x:2*
 *  }*

 We tried different combinations like
  - with spark 1.1.0 and 1.1.1.
  - by increasing executor memory
  - by changing the serialization strategy (switching between kryo and
 normal java)
  - by changing broadcast strategy (switching between http and torrent
 broadcast)


 Can anyone give any insight what we are missing here? How can we fix this?

 Due to akka version mismatch with some other libraries we cannot upgrade
 the spark version.

 Thanks,
 --

 Sourav Chandra

 Senior Software Engineer

 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

 sourav.chan...@livestream.com

 o: +91 80 4121 8723

 m: +91 988 699 3746

 skype: sourav.chandra

 Livestream

 Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
 Block, Koramangala Industrial Area,

 Bangalore 560034

 www.livestream.com



OutOfMemory error in Spark Core

2015-01-15 Thread Anand Mohan
We have our Analytics App built on Spark 1.1 Core, Parquet, Avro and Spray.
We are using Kryo serializer for the Avro objects read from Parquet and we
are using our custom Kryo registrator (along the lines of  ADAM
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/serialization/ADAMKryoRegistrator.scala#L51
 
, we just added batched writes and flushes to Kryo's Output for each 512 MB
in the stream, as below 
outstream.array.sliding(512MB).foreach(buf = {
  kryoOut.write(buf)
  kryoOut.flush()
})
)

Our queries are done to a cached RDD(MEMORY_ONLY), that is obtained after 
1. loading bulk data from Parquet
2. union-ing it with incremental data in Avro
3. doing timestamp based duplicate removal (including partitioning in
reduceByKey) and 
4. joining a couple of MySQL tables using JdbcRdd

Of late, we are seeing major instabilities where the app crashes on a lost
executor which itself failed due to a OutOfMemory error as below. This looks
almost identical to https://issues.apache.org/jira/browse/SPARK-4885 even
though we are seeing this error in Spark 1.1

2015-01-15 20:12:51,653 [handle-message-executor-13] ERROR
org.apache.spark.executor.ExecutorUncaughtExceptionHandler - Uncaught
exception in thread Thread[handle-message-executor-13,5,main]
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271)
at
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.write(Output.java:183)
at
com.philips.hc.eici.analytics.streamingservice.AvroSerializer$$anonfun$write$1.apply(AnalyticsKryoRegistrator.scala:31)
at
com.philips.hc.eici.analytics.streamingservice.AvroSerializer$$anonfun$write$1.apply(AnalyticsKryoRegistrator.scala:30)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
com.philips.hc.eici.analytics.streamingservice.AvroSerializer.write(AnalyticsKryoRegistrator.scala:30)
at
com.philips.hc.eici.analytics.streamingservice.AvroSerializer.write(AnalyticsKryoRegistrator.scala:18)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:119)
at
org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110)
at
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1047)
at
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1056)
at
org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:154)
at
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:421)
at
org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:387)
at
org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:100)
at
org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:79)
at
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
at
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)


The driver log is as below

15/01/15 12:12:53 ERROR scheduler.DAGSchedulerActorSupervisor:
eventProcesserActor failed; shutting down SparkContext
java.util.NoSuchElementException: key not found: 2539
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:769)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4

Re: OutOfMemory error in Spark Core

2015-01-15 Thread Akhil Das
Did you try increasing the parallelism?

Thanks
Best Regards

On Fri, Jan 16, 2015 at 10:41 AM, Anand Mohan chinn...@gmail.com wrote:

 We have our Analytics App built on Spark 1.1 Core, Parquet, Avro and Spray.
 We are using Kryo serializer for the Avro objects read from Parquet and we
 are using our custom Kryo registrator (along the lines of  ADAM
 
 https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/serialization/ADAMKryoRegistrator.scala#L51
 
 , we just added batched writes and flushes to Kryo's Output for each 512 MB
 in the stream, as below
 outstream.array.sliding(512MB).foreach(buf = {
   kryoOut.write(buf)
   kryoOut.flush()
 })
 )

 Our queries are done to a cached RDD(MEMORY_ONLY), that is obtained after
 1. loading bulk data from Parquet
 2. union-ing it with incremental data in Avro
 3. doing timestamp based duplicate removal (including partitioning in
 reduceByKey) and
 4. joining a couple of MySQL tables using JdbcRdd

 Of late, we are seeing major instabilities where the app crashes on a lost
 executor which itself failed due to a OutOfMemory error as below. This
 looks
 almost identical to https://issues.apache.org/jira/browse/SPARK-4885 even
 though we are seeing this error in Spark 1.1

 2015-01-15 20:12:51,653 [handle-message-executor-13] ERROR
 org.apache.spark.executor.ExecutorUncaughtExceptionHandler - Uncaught
 exception in thread Thread[handle-message-executor-13,5,main]
 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at
 java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
 at
 java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
 at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
 at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
 at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
 at com.esotericsoftware.kryo.io.Output.write(Output.java:183)
 at

 com.philips.hc.eici.analytics.streamingservice.AvroSerializer$$anonfun$write$1.apply(AnalyticsKryoRegistrator.scala:31)
 at

 com.philips.hc.eici.analytics.streamingservice.AvroSerializer$$anonfun$write$1.apply(AnalyticsKryoRegistrator.scala:30)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at

 com.philips.hc.eici.analytics.streamingservice.AvroSerializer.write(AnalyticsKryoRegistrator.scala:30)
 at

 com.philips.hc.eici.analytics.streamingservice.AvroSerializer.write(AnalyticsKryoRegistrator.scala:18)
 at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
 at

 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
 at

 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
 at
 com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
 at

 org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:119)
 at

 org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110)
 at

 org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1047)
 at

 org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1056)
 at
 org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:154)
 at
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:421)
 at
 org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:387)
 at

 org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:100)
 at

 org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:79)
 at

 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
 at

 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:48)
 at

 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)


 The driver log is as below

 15/01/15 12:12:53 ERROR scheduler.DAGSchedulerActorSupervisor:
 eventProcesserActor failed; shutting down SparkContext
 java.util.NoSuchElementException: key not found: 2539
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:799

Re: OutOfMemory Error

2014-08-20 Thread MEETHU MATHEW


 Hi ,

How to increase the heap size?

What is the difference between spark executor memory and heap size?

Thanks  Regards, 
Meethu M


On Monday, 18 August 2014 12:35 PM, Akhil Das ak...@sigmoidanalytics.com 
wrote:
 


I believe spark.shuffle.memoryFraction is the one you are looking for.

spark.shuffle.memoryFraction : Fraction of Java heap to use for aggregation and 
cogroups during shuffles, if spark.shuffle.spill is true. At any given time, 
the collective size of all in-memory maps used for shuffles is bounded by this 
limit, beyond which the contents will begin to spill to disk. If spills are 
often, consider increasing this value at the expense of 
spark.storage.memoryFraction.


You can give it a try.



Thanks
Best Regards


On Mon, Aug 18, 2014 at 12:21 PM, Ghousia ghousia.ath...@gmail.com wrote:

Thanks for the answer Akhil. We are right now getting rid of this issue by 
increasing the number of partitions. And we are persisting RDDs to DISK_ONLY. 
But the issue is with heavy computations within an RDD. It would be better if 
we have the option of spilling the intermediate transformation results to local 
disk (only in case if memory consumption is high)  . Do we have any such option 
available with Spark? If increasing the partitions is the only the way, then 
one might end up with OutOfMemory Errors, when working with certain algorithms 
where intermediate result is huge.




On Mon, Aug 18, 2014 at 12:02 PM, Akhil Das ak...@sigmoidanalytics.com wrote:

Hi Ghousia,


You can try the following:


1. Increase the heap size
2. Increase the number of partitions
3. You could try persisting the RDD to use DISK_ONLY




Thanks
Best Regards



On Mon, Aug 18, 2014 at 10:40 AM, Ghousia Taj ghousia.ath...@gmail.com 
wrote:

Hi,

I am trying to implement machine learning algorithms on Spark. I am working
on a 3 node cluster, with each node having 5GB of memory. Whenever I am
working with slightly more number of records, I end up with OutOfMemory
Error. Problem is, even if number of records is slightly high, the
intermediate result from a transformation is huge and this results in
OutOfMemory Error. To overcome this, we are partitioning the data such that
each partition has only a few records.

Is there any better way to fix this issue. Some thing like spilling the
intermediate data to local disk?

Thanks,
Ghousia.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-Error-tp12275.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





RE: OutOfMemory Error

2014-08-20 Thread Shao, Saisai
Hi Meethu,

The spark.executor.memory is the Java heap size of forked executor process. 
Increasing the spark.executor.memory can actually increase the runtime heap 
size of executor process.

For the details of Spark configurations, you can check: 
http://spark.apache.org/docs/latest/configuration.html

Thanks
Jerry

From: MEETHU MATHEW [mailto:meethu2...@yahoo.co.in]
Sent: Wednesday, August 20, 2014 4:48 PM
To: Akhil Das; Ghousia
Cc: user@spark.apache.org
Subject: Re: OutOfMemory Error


 Hi ,

How to increase the heap size?

What is the difference between spark executor memory and heap size?

Thanks  Regards,
Meethu M

On Monday, 18 August 2014 12:35 PM, Akhil Das 
ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com wrote:

I believe spark.shuffle.memoryFraction is the one you are looking for.

spark.shuffle.memoryFraction : Fraction of Java heap to use for aggregation and 
cogroups during shuffles, if spark.shuffle.spill is true. At any given time, 
the collective size of all in-memory maps used for shuffles is bounded by this 
limit, beyond which the contents will begin to spill to disk. If spills are 
often, consider increasing this value at the expense of 
spark.storage.memoryFraction.

You can give it a try.


Thanks
Best Regards

On Mon, Aug 18, 2014 at 12:21 PM, Ghousia 
ghousia.ath...@gmail.commailto:ghousia.ath...@gmail.com wrote:
Thanks for the answer Akhil. We are right now getting rid of this issue by 
increasing the number of partitions. And we are persisting RDDs to DISK_ONLY. 
But the issue is with heavy computations within an RDD. It would be better if 
we have the option of spilling the intermediate transformation results to local 
disk (only in case if memory consumption is high)  . Do we have any such option 
available with Spark? If increasing the partitions is the only the way, then 
one might end up with OutOfMemory Errors, when working with certain algorithms 
where intermediate result is huge.

On Mon, Aug 18, 2014 at 12:02 PM, Akhil Das 
ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com wrote:
Hi Ghousia,

You can try the following:

1. Increase the heap 
sizehttps://spark.apache.org/docs/0.9.0/configuration.html
2. Increase the number of 
partitionshttp://stackoverflow.com/questions/21698443/spark-best-practice-for-retrieving-big-data-from-rdd-to-local-machine
3. You could try persisting the RDD to use 
DISK_ONLYhttp://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence


Thanks
Best Regards

On Mon, Aug 18, 2014 at 10:40 AM, Ghousia Taj 
ghousia.ath...@gmail.commailto:ghousia.ath...@gmail.com wrote:
Hi,

I am trying to implement machine learning algorithms on Spark. I am working
on a 3 node cluster, with each node having 5GB of memory. Whenever I am
working with slightly more number of records, I end up with OutOfMemory
Error. Problem is, even if number of records is slightly high, the
intermediate result from a transformation is huge and this results in
OutOfMemory Error. To overcome this, we are partitioning the data such that
each partition has only a few records.

Is there any better way to fix this issue. Some thing like spilling the
intermediate data to local disk?

Thanks,
Ghousia.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-Error-tp12275.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org






Re: OutOfMemory Error

2014-08-19 Thread Ghousia
Hi,

Any further info on this??

Do you think it would be useful if we have a in memory buffer implemented
that stores the content of the new RDD. In case the buffer reaches a
configured threshold, content of the buffer are spilled to the local disk.
This saves us from OutOfMememory Error.

Appreciate any suggestions in this regard.

Many Thanks,
Ghousia.


On Mon, Aug 18, 2014 at 4:05 PM, Ghousia ghousia.ath...@gmail.com wrote:

 But this would be applicable only to operations that have a shuffle phase.

 This might not be applicable to a simple Map operation where a record is
 mapped to a new huge value, resulting in OutOfMemory Error.



 On Mon, Aug 18, 2014 at 12:34 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 I believe spark.shuffle.memoryFraction is the one you are looking for.

 spark.shuffle.memoryFraction : Fraction of Java heap to use for
 aggregation and cogroups during shuffles, if spark.shuffle.spill is
 true. At any given time, the collective size of all in-memory maps used for
 shuffles is bounded by this limit, beyond which the contents will begin to
 spill to disk. If spills are often, consider increasing this value at the
 expense of spark.storage.memoryFraction.

 You can give it a try.


 Thanks
 Best Regards


 On Mon, Aug 18, 2014 at 12:21 PM, Ghousia ghousia.ath...@gmail.com
 wrote:

 Thanks for the answer Akhil. We are right now getting rid of this issue
 by increasing the number of partitions. And we are persisting RDDs to
 DISK_ONLY. But the issue is with heavy computations within an RDD. It would
 be better if we have the option of spilling the intermediate transformation
 results to local disk (only in case if memory consumption is high)  . Do we
 have any such option available with Spark? If increasing the partitions is
 the only the way, then one might end up with OutOfMemory Errors, when
 working with certain algorithms where intermediate result is huge.


 On Mon, Aug 18, 2014 at 12:02 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Hi Ghousia,

 You can try the following:

 1. Increase the heap size
 https://spark.apache.org/docs/0.9.0/configuration.html
 2. Increase the number of partitions
 http://stackoverflow.com/questions/21698443/spark-best-practice-for-retrieving-big-data-from-rdd-to-local-machine
 3. You could try persisting the RDD to use DISK_ONLY
 http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence



 Thanks
 Best Regards


 On Mon, Aug 18, 2014 at 10:40 AM, Ghousia Taj ghousia.ath...@gmail.com
  wrote:

 Hi,

 I am trying to implement machine learning algorithms on Spark. I am
 working
 on a 3 node cluster, with each node having 5GB of memory. Whenever I am
 working with slightly more number of records, I end up with OutOfMemory
 Error. Problem is, even if number of records is slightly high, the
 intermediate result from a transformation is huge and this results in
 OutOfMemory Error. To overcome this, we are partitioning the data such
 that
 each partition has only a few records.

 Is there any better way to fix this issue. Some thing like spilling the
 intermediate data to local disk?

 Thanks,
 Ghousia.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-Error-tp12275.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








Re: OutOfMemory Error

2014-08-18 Thread Akhil Das
Hi Ghousia,

You can try the following:

1. Increase the heap size
https://spark.apache.org/docs/0.9.0/configuration.html
2. Increase the number of partitions
http://stackoverflow.com/questions/21698443/spark-best-practice-for-retrieving-big-data-from-rdd-to-local-machine
3. You could try persisting the RDD to use DISK_ONLY
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence



Thanks
Best Regards


On Mon, Aug 18, 2014 at 10:40 AM, Ghousia Taj ghousia.ath...@gmail.com
wrote:

 Hi,

 I am trying to implement machine learning algorithms on Spark. I am working
 on a 3 node cluster, with each node having 5GB of memory. Whenever I am
 working with slightly more number of records, I end up with OutOfMemory
 Error. Problem is, even if number of records is slightly high, the
 intermediate result from a transformation is huge and this results in
 OutOfMemory Error. To overcome this, we are partitioning the data such that
 each partition has only a few records.

 Is there any better way to fix this issue. Some thing like spilling the
 intermediate data to local disk?

 Thanks,
 Ghousia.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-Error-tp12275.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: OutOfMemory Error

2014-08-18 Thread Ghousia
Thanks for the answer Akhil. We are right now getting rid of this issue by
increasing the number of partitions. And we are persisting RDDs to
DISK_ONLY. But the issue is with heavy computations within an RDD. It would
be better if we have the option of spilling the intermediate transformation
results to local disk (only in case if memory consumption is high)  . Do we
have any such option available with Spark? If increasing the partitions is
the only the way, then one might end up with OutOfMemory Errors, when
working with certain algorithms where intermediate result is huge.


On Mon, Aug 18, 2014 at 12:02 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Hi Ghousia,

 You can try the following:

 1. Increase the heap size
 https://spark.apache.org/docs/0.9.0/configuration.html
 2. Increase the number of partitions
 http://stackoverflow.com/questions/21698443/spark-best-practice-for-retrieving-big-data-from-rdd-to-local-machine
 3. You could try persisting the RDD to use DISK_ONLY
 http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence



 Thanks
 Best Regards


 On Mon, Aug 18, 2014 at 10:40 AM, Ghousia Taj ghousia.ath...@gmail.com
 wrote:

 Hi,

 I am trying to implement machine learning algorithms on Spark. I am
 working
 on a 3 node cluster, with each node having 5GB of memory. Whenever I am
 working with slightly more number of records, I end up with OutOfMemory
 Error. Problem is, even if number of records is slightly high, the
 intermediate result from a transformation is huge and this results in
 OutOfMemory Error. To overcome this, we are partitioning the data such
 that
 each partition has only a few records.

 Is there any better way to fix this issue. Some thing like spilling the
 intermediate data to local disk?

 Thanks,
 Ghousia.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-Error-tp12275.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: OutOfMemory Error

2014-08-18 Thread Akhil Das
I believe spark.shuffle.memoryFraction is the one you are looking for.

spark.shuffle.memoryFraction : Fraction of Java heap to use for aggregation
and cogroups during shuffles, if spark.shuffle.spill is true. At any given
time, the collective size of all in-memory maps used for shuffles is
bounded by this limit, beyond which the contents will begin to spill to
disk. If spills are often, consider increasing this value at the expense of
spark.storage.memoryFraction.

You can give it a try.


Thanks
Best Regards


On Mon, Aug 18, 2014 at 12:21 PM, Ghousia ghousia.ath...@gmail.com wrote:

 Thanks for the answer Akhil. We are right now getting rid of this issue by
 increasing the number of partitions. And we are persisting RDDs to
 DISK_ONLY. But the issue is with heavy computations within an RDD. It would
 be better if we have the option of spilling the intermediate transformation
 results to local disk (only in case if memory consumption is high)  . Do we
 have any such option available with Spark? If increasing the partitions is
 the only the way, then one might end up with OutOfMemory Errors, when
 working with certain algorithms where intermediate result is huge.


 On Mon, Aug 18, 2014 at 12:02 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Hi Ghousia,

 You can try the following:

 1. Increase the heap size
 https://spark.apache.org/docs/0.9.0/configuration.html
 2. Increase the number of partitions
 http://stackoverflow.com/questions/21698443/spark-best-practice-for-retrieving-big-data-from-rdd-to-local-machine
 3. You could try persisting the RDD to use DISK_ONLY
 http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence



 Thanks
 Best Regards


 On Mon, Aug 18, 2014 at 10:40 AM, Ghousia Taj ghousia.ath...@gmail.com
 wrote:

 Hi,

 I am trying to implement machine learning algorithms on Spark. I am
 working
 on a 3 node cluster, with each node having 5GB of memory. Whenever I am
 working with slightly more number of records, I end up with OutOfMemory
 Error. Problem is, even if number of records is slightly high, the
 intermediate result from a transformation is huge and this results in
 OutOfMemory Error. To overcome this, we are partitioning the data such
 that
 each partition has only a few records.

 Is there any better way to fix this issue. Some thing like spilling the
 intermediate data to local disk?

 Thanks,
 Ghousia.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-Error-tp12275.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






OutOfMemory Error

2014-08-17 Thread Ghousia Taj
Hi,

I am trying to implement machine learning algorithms on Spark. I am working
on a 3 node cluster, with each node having 5GB of memory. Whenever I am
working with slightly more number of records, I end up with OutOfMemory
Error. Problem is, even if number of records is slightly high, the
intermediate result from a transformation is huge and this results in
OutOfMemory Error. To overcome this, we are partitioning the data such that
each partition has only a few records. 

Is there any better way to fix this issue. Some thing like spilling the
intermediate data to local disk?

Thanks,
Ghousia.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-Error-tp12275.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org