Re: Idiomatic way to rate-limit streaming sources to avoid OutOfMemoryError?

2024-04-07 Thread Mich Talebzadeh
OK,

This is a common issue in Spark Structured Streaming (SSS), where the
source generates data faster than Spark can process it. SSS doesn't have a
built-in mechanism for directly rate-limiting the incoming data stream
itself. However, consider the following:


   - Limit the rate at which data is produced. This can involve configuring
   the data source itself to emit data at a controlled rate or implementing
   rate limiting mechanisms in the application or system that produces the
   data.
   - SSS supports backpressure, which allows it to dynamically adjust the
   ingestion rate based on the processing capacity of the system. This can
   help prevent overwhelming the system with data. To enable backpressure, set
   the appropriate configuration properties such as
spark.conf.set("spark.streaming.backpressure.enabled",
   "true") and spark.streaming.backpressure.initialRate.
   - Consider adjusting the micro-batch interval to control the rate at
   which data is processed. Increasing the micro-batch interval and reduce the
   frequency of processing, allowing more time for each batch to be processed
   and reducing the likelihood of out-of-memory
   errors.. spark.conf.set("spark.sql.streaming.trigger.interval", "
   seconds"
   -  Dynamic Resource Allocation (DRA), Not implemented yet. DRA will
   automatically adjust allocated resources based on workload. This ensures
   Spark has enough resources to process incoming data within the trigger
   interval, preventing backlogs and potential OOM issues.


>From Spark UI, look at the streaming tab. There are various statistics
there. In general your Processing Time has to be less than your batch
interval. The scheduling Delay and Total Delay are additional indicator of
health.

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Sun, 7 Apr 2024 at 15:11, Baran, Mert  wrote:

> Hi Spark community,
>
> I have a Spark Structured Streaming application that reads data from a
> socket source (implemented very similarly to the
> TextSocketMicroBatchStream). The issue is that the source can generate
> data faster than Spark can process it, eventually leading to an
> OutOfMemoryError when Spark runs out of memory trying to queue up all
> the pending data.
>
> I'm looking for advice on the most idiomatic/recommended way in Spark to
> rate-limit data ingestion to avoid overwhelming the system.
>
> Approaches I've considered:
>
> 1. Using a BlockingQueue with a fixed size to throttle the data.
> However, this requires careful tuning of the queue size. If too small,
> it limits throughput; if too large, you risk batches taking too long.
>
> 2. Fetching a limited number of records in the PartitionReader's next(),
> adding the records into a queue and checking if the queue is empty.
> However, I'm not sure if there is a built-in way to dynamically scale
> the number of records fetched (i.e., dynamically calculating the offset)
> based on the system load and capabilities.
>
> So in summary, what is the recommended way to dynamically rate-limit a
> streaming source to match Spark's processing capacity and avoid
> out-of-memory issues? Are there any best practices or configuration
> options I should look at?
> Any guidance would be much appreciated! Let me know if you need any
> other details.
>
> Thanks,
> Mert
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Idiomatic way to rate-limit streaming sources to avoid OutOfMemoryError?

2024-04-07 Thread Baran, Mert

Hi Spark community,

I have a Spark Structured Streaming application that reads data from a 
socket source (implemented very similarly to the 
TextSocketMicroBatchStream). The issue is that the source can generate 
data faster than Spark can process it, eventually leading to an 
OutOfMemoryError when Spark runs out of memory trying to queue up all 
the pending data.


I'm looking for advice on the most idiomatic/recommended way in Spark to 
rate-limit data ingestion to avoid overwhelming the system.


Approaches I've considered:

1. Using a BlockingQueue with a fixed size to throttle the data. 
However, this requires careful tuning of the queue size. If too small, 
it limits throughput; if too large, you risk batches taking too long.


2. Fetching a limited number of records in the PartitionReader's next(), 
adding the records into a queue and checking if the queue is empty. 
However, I'm not sure if there is a built-in way to dynamically scale 
the number of records fetched (i.e., dynamically calculating the offset) 
based on the system load and capabilities.


So in summary, what is the recommended way to dynamically rate-limit a 
streaming source to match Spark's processing capacity and avoid 
out-of-memory issues? Are there any best practices or configuration 
options I should look at?
Any guidance would be much appreciated! Let me know if you need any 
other details.


Thanks,
Mert


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Beginner Debug]: Executor OutOfMemoryError

2024-02-23 Thread Mich Talebzadeh
Seems like you are having memory issues. Examine your settings.

   1.  It appears that  your driver memory setting is too high. It should
   be a fraction of total memy provided by YARN
   2. Use the Spark UI to monitor the job's memory consumption. Check the
   Storage tab to see how memory is being utilized across caches, data, and
   shuffle.
   3. Check the Executors tab to identify tasks or executors that are
   experiencing memory issues. Look for tasks with high input sizes or shuffle
   spills.
   4. In YARN mode, consider setting spark.executor.memoryOverhead property
   to handle executor overhead. This is important for tasks that require
   additional memory beyond the executor memory setting. Example
   5. --conf spark.executor.memoryOverhead=1000

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Fri, 23 Feb 2024 at 02:42, Shawn Ligocki  wrote:

> Hi I'm new to Spark and I'm running into a lot of OOM issues while trying
> to scale up my first Spark application. I am running into these issues with
> only 1% of the final expected data size. Can anyone help me understand how
> to properly configure Spark to use limited memory or how to debug which
> part of my application is causing so much memory trouble?
>
> My logs end up with tons of messages like:
>
> 24/02/22 10:51:01 WARN TaskMemoryManager: Failed to allocate a page
>> (134217728 bytes), try again.
>> 24/02/22 10:51:01 WARN RowBasedKeyValueBatch: Calling spill() on
>> RowBasedKeyValueBatch. Will not spill but return 0.
>> 24/02/22 10:52:28 WARN Executor: Issue communicating with driver in
>> heartbeater
>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [1
>> milliseconds]. This timeout is controlled by
>> spark.executor.heartbeatInterval
>> ...
>> 24/02/22 10:58:17 WARN NettyRpcEnv: Ignored message:
>> HeartbeatResponse(false)
>> 24/02/22 10:58:17 WARN HeartbeatReceiver: Removing executor driver with
>> no recent heartbeats: 207889 ms exceeds timeout 12 ms
>> 24/02/22 10:58:17 ERROR Executor: Exception in task 175.0 in stage 2.0
>> (TID 676)
>> java.lang.OutOfMemoryError: Java heap space
>> ...
>
>
> Background: The goal of this application is to load a large number of
> parquet files, group by a couple fields and compute some summarization
> metrics for each group and write the result out. In Python basically:
>
> from pyspark.sql import SparkSession
>> import pyspark.sql.functions as func
>
>
>> spark = SparkSession.builder.getOrCreate()
>> df = spark.read.parquet(*pred_paths)
>> df = df.groupBy("point_id", "species_code").agg(
>>   func.count("pred_occ").alias("ensemble_support"))
>> df.write.parquet(output_path)
>
>
> And I am launching it with:
>
> spark-submit \
>>   --name ensemble \
>>   --driver-memory 64g --executor-memory 64g \
>>   stem/ensemble_spark.py
>
>
> I noticed that increasing --driver-memory and --executor-memory did help
> me scale up somewhat, but I cannot increase those forever.
>
> Some details:
>
>- All my tests are currently on a single cluster node (with 128GB RAM
>& 64 CPU cores) or locally on my laptop (32GB RAM & 12 CPU cores).
>Eventually, I expect to run this in parallel on the cluster.
>- This is running on Spark 3.0.1 (in the cluster), I'm seeing the same
>issues with 3.5 on my laptop.
>- The input data is tons of parquet files stored on NFS. For the final
>application it will be about 50k parquet files ranging in size up to 15GB
>each. Total size of 100TB, 4 trillion rows, 5 columns. I am currently
>testing with ~1% this size: 500 files, 1TB total, 40B rows total.
>- There should only be a max of 100 rows per group. So I expect an
>output size somewhere in the range 1-5TB, 40-200B rows. For the test: 50GB,
>2B rows. These output files are also written to NFS.
>- The rows for the same groups are not near each other. Ex: no single
>parquet file will have any two rows for the same group.
>
> Here are some questions I have:
>
>1. Does Spark know how much memory is available? Do I need to tell it
>somehow? Is there other configuration that I should set up for a run like
>this? I know that 1TB input data is too much to fit in memory, but I
>assumed that Spark would work on it in small enough batches to fit. Do I
>need to configure those batches somehow?
>2. How can I debug what is causing it to OOM?
>3. Does this have something to do with the fact that I'm loading the
>  

[Beginner Debug]: Executor OutOfMemoryError

2024-02-22 Thread Shawn Ligocki
Hi I'm new to Spark and I'm running into a lot of OOM issues while trying
to scale up my first Spark application. I am running into these issues with
only 1% of the final expected data size. Can anyone help me understand how
to properly configure Spark to use limited memory or how to debug which
part of my application is causing so much memory trouble?

My logs end up with tons of messages like:

24/02/22 10:51:01 WARN TaskMemoryManager: Failed to allocate a page
> (134217728 bytes), try again.
> 24/02/22 10:51:01 WARN RowBasedKeyValueBatch: Calling spill() on
> RowBasedKeyValueBatch. Will not spill but return 0.
> 24/02/22 10:52:28 WARN Executor: Issue communicating with driver in
> heartbeater
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [1
> milliseconds]. This timeout is controlled by
> spark.executor.heartbeatInterval
> ...
> 24/02/22 10:58:17 WARN NettyRpcEnv: Ignored message:
> HeartbeatResponse(false)
> 24/02/22 10:58:17 WARN HeartbeatReceiver: Removing executor driver with no
> recent heartbeats: 207889 ms exceeds timeout 12 ms
> 24/02/22 10:58:17 ERROR Executor: Exception in task 175.0 in stage 2.0
> (TID 676)
> java.lang.OutOfMemoryError: Java heap space
> ...


Background: The goal of this application is to load a large number of
parquet files, group by a couple fields and compute some summarization
metrics for each group and write the result out. In Python basically:

from pyspark.sql import SparkSession
> import pyspark.sql.functions as func


> spark = SparkSession.builder.getOrCreate()
> df = spark.read.parquet(*pred_paths)
> df = df.groupBy("point_id", "species_code").agg(
>   func.count("pred_occ").alias("ensemble_support"))
> df.write.parquet(output_path)


And I am launching it with:

spark-submit \
>   --name ensemble \
>   --driver-memory 64g --executor-memory 64g \
>   stem/ensemble_spark.py


I noticed that increasing --driver-memory and --executor-memory did help me
scale up somewhat, but I cannot increase those forever.

Some details:

   - All my tests are currently on a single cluster node (with 128GB RAM &
   64 CPU cores) or locally on my laptop (32GB RAM & 12 CPU cores).
   Eventually, I expect to run this in parallel on the cluster.
   - This is running on Spark 3.0.1 (in the cluster), I'm seeing the same
   issues with 3.5 on my laptop.
   - The input data is tons of parquet files stored on NFS. For the final
   application it will be about 50k parquet files ranging in size up to 15GB
   each. Total size of 100TB, 4 trillion rows, 5 columns. I am currently
   testing with ~1% this size: 500 files, 1TB total, 40B rows total.
   - There should only be a max of 100 rows per group. So I expect an
   output size somewhere in the range 1-5TB, 40-200B rows. For the test: 50GB,
   2B rows. These output files are also written to NFS.
   - The rows for the same groups are not near each other. Ex: no single
   parquet file will have any two rows for the same group.

Here are some questions I have:

   1. Does Spark know how much memory is available? Do I need to tell it
   somehow? Is there other configuration that I should set up for a run like
   this? I know that 1TB input data is too much to fit in memory, but I
   assumed that Spark would work on it in small enough batches to fit. Do I
   need to configure those batches somehow?
   2. How can I debug what is causing it to OOM?
   3. Does this have something to do with the fact that I'm loading the
   data from Parquet files? Or that I'm loading so many different files? Or
   that I'm loading them from NFS?
   4. Do I need to configure the reduce step (group and aggregation)
   differently because of the type of data I have (large numbers of groups,
   stratified groups)?

Thank you!
-Shawn Ligocki


Re: OutOfMemoryError

2021-07-06 Thread Mich Talebzadeh
Personally rather than


Parameters here:

val spark = SparkSession
  .builder
  .master("local[*]")
  .appName("OOM")
  .config("spark.driver.host", "localhost")
  .config("spark.driver.maxResultSize", "0")
  .config("spark.sql.caseSensitive", "false")
  .config("spark.sql.adaptive.enabled", "true")
  .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
  .config("spark.driver.memory", "24g")
  .getOrCreate()

I leave the spec to run time

def spark_session_local(appName):
return SparkSession.builder \
.master('local[*]') \
.appName(appName) \
.enableHiveSupport() \
.getOrCreate()



And then pass the parameters at Spark submit


${SPARK_HOME}/bin/spark-submit \
--driver-memory 8G \
--num-executors 1 \
 --master local \
--executor-cores 2 \
--conf "spark.scheduler.mode=FIFO" \
--conf "spark.ui.port=5" \
--conf spark.executor.memoryOverhead=3000 \

HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 1 Jul 2021 at 12:44, Sean Owen  wrote:

> You need to set driver memory before the driver starts, on the CLI or
> however you run your app, not in the app itself. By the time the driver
> starts to run your app, its heap is already set.
>
> On Thu, Jul 1, 2021 at 12:10 AM javaguy Java  wrote:
>
>> Hi,
>>
>> I'm getting Java OOM errors even though I'm setting my driver memory to
>> 24g and I'm executing against local[*]
>>
>> I was wondering if anyone can give me any insight.  The server this job is 
>> running on has more than enough memory as does the spark driver.
>>
>> The final result does write 3 csv files that are 300MB each so there's no 
>> way its coming close to the 24g
>>
>> From the OOM, I don't know about the internals of Spark itself to tell me 
>> where this is failing + how I should refactor or change anything
>>
>> Would appreciate any advice on how I can resolve
>>
>> Thx
>>
>>
>> Parameters here:
>>
>> val spark = SparkSession
>>   .builder
>>   .master("local[*]")
>>   .appName("OOM")
>>   .config("spark.driver.host", "localhost")
>>   .config("spark.driver.maxResultSize", "0")
>>   .config("spark.sql.caseSensitive", "false")
>>   .config("spark.sql.adaptive.enabled", "true")
>>   .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
>>   .config("spark.driver.memory", "24g")
>>   .getOrCreate()
>>
>>
>> My OOM errors are below:
>>
>> driver): java.lang.OutOfMemoryError: Java heap space
>>  at java.io.BufferedOutputStream.(BufferedOutputStream.java:76)
>>  at 
>> org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:109)
>>  at 
>> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:110)
>>  at 
>> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:118)
>>  at 
>> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
>>  at 
>> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
>>  at 
>> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>>  at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>>  at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:127)
>>  at 
>> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>>  at 
>> org.apache.spark.executor.Executor$TaskRunner$$Lambda$1792/1058609963.apply(Unknown
>>  Source)
>>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>  at java.lang.Thread.run(Thread.java:748)
>>  
>>  
>>  
>>  
>> driver): java.lang.OutOfMemoryError: Java heap space
>>  at 
>> net.jpountz.lz4.LZ4BlockOutputStream.(LZ4BlockOutputStream.java:102)
>>  at 
>> org.apache.spark.io.LZ4CompressionCodec.compressedOutputStream(CompressionCodec.scala:145)
>>  at 
>> org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:158)
>>  at 
>> org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:133)
>>  at 
>> 

Re: OutOfMemoryError

2021-07-06 Thread javaguy Java
Hi Sean, thx for the tip.  I'm just running my app via spark-submit on CLI
ie >spark-submit --class X --master local[*] assembly.jar so I'll now add
to CLI args ie: spark-submit --class X --master local[*]
--driver-memory 8g assembly.jar
etc.

Unless I have this wrong?

Thx


On Thu, Jul 1, 2021 at 1:43 PM Sean Owen  wrote:

> You need to set driver memory before the driver starts, on the CLI or
> however you run your app, not in the app itself. By the time the driver
> starts to run your app, its heap is already set.
>
> On Thu, Jul 1, 2021 at 12:10 AM javaguy Java  wrote:
>
>> Hi,
>>
>> I'm getting Java OOM errors even though I'm setting my driver memory to
>> 24g and I'm executing against local[*]
>>
>> I was wondering if anyone can give me any insight.  The server this job is 
>> running on has more than enough memory as does the spark driver.
>>
>> The final result does write 3 csv files that are 300MB each so there's no 
>> way its coming close to the 24g
>>
>> From the OOM, I don't know about the internals of Spark itself to tell me 
>> where this is failing + how I should refactor or change anything
>>
>> Would appreciate any advice on how I can resolve
>>
>> Thx
>>
>>
>> Parameters here:
>>
>> val spark = SparkSession
>>   .builder
>>   .master("local[*]")
>>   .appName("OOM")
>>   .config("spark.driver.host", "localhost")
>>   .config("spark.driver.maxResultSize", "0")
>>   .config("spark.sql.caseSensitive", "false")
>>   .config("spark.sql.adaptive.enabled", "true")
>>   .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
>>   .config("spark.driver.memory", "24g")
>>   .getOrCreate()
>>
>>
>> My OOM errors are below:
>>
>> driver): java.lang.OutOfMemoryError: Java heap space
>>  at java.io.BufferedOutputStream.(BufferedOutputStream.java:76)
>>  at 
>> org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:109)
>>  at 
>> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:110)
>>  at 
>> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:118)
>>  at 
>> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
>>  at 
>> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
>>  at 
>> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>>  at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>>  at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:127)
>>  at 
>> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>>  at 
>> org.apache.spark.executor.Executor$TaskRunner$$Lambda$1792/1058609963.apply(Unknown
>>  Source)
>>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>  at java.lang.Thread.run(Thread.java:748)
>>  
>>  
>>  
>>  
>> driver): java.lang.OutOfMemoryError: Java heap space
>>  at 
>> net.jpountz.lz4.LZ4BlockOutputStream.(LZ4BlockOutputStream.java:102)
>>  at 
>> org.apache.spark.io.LZ4CompressionCodec.compressedOutputStream(CompressionCodec.scala:145)
>>  at 
>> org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:158)
>>  at 
>> org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:133)
>>  at 
>> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:122)
>>  at 
>> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
>>  at 
>> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
>>  at 
>> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>>  at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>>  at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:127)
>>  at 
>> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>>  at 
>> org.apache.spark.executor.Executor$TaskRunner$$Lambda$1792/249605067.apply(Unknown
>>  Source)
>>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>  at 
>> 

Re: OutOfMemoryError

2021-07-01 Thread Sean Owen
You need to set driver memory before the driver starts, on the CLI or
however you run your app, not in the app itself. By the time the driver
starts to run your app, its heap is already set.

On Thu, Jul 1, 2021 at 12:10 AM javaguy Java  wrote:

> Hi,
>
> I'm getting Java OOM errors even though I'm setting my driver memory to
> 24g and I'm executing against local[*]
>
> I was wondering if anyone can give me any insight.  The server this job is 
> running on has more than enough memory as does the spark driver.
>
> The final result does write 3 csv files that are 300MB each so there's no way 
> its coming close to the 24g
>
> From the OOM, I don't know about the internals of Spark itself to tell me 
> where this is failing + how I should refactor or change anything
>
> Would appreciate any advice on how I can resolve
>
> Thx
>
>
> Parameters here:
>
> val spark = SparkSession
>   .builder
>   .master("local[*]")
>   .appName("OOM")
>   .config("spark.driver.host", "localhost")
>   .config("spark.driver.maxResultSize", "0")
>   .config("spark.sql.caseSensitive", "false")
>   .config("spark.sql.adaptive.enabled", "true")
>   .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
>   .config("spark.driver.memory", "24g")
>   .getOrCreate()
>
>
> My OOM errors are below:
>
> driver): java.lang.OutOfMemoryError: Java heap space
>   at java.io.BufferedOutputStream.(BufferedOutputStream.java:76)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:109)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:110)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:118)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$Lambda$1792/1058609963.apply(Unknown
>  Source)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>   
>   
>   
>   
> driver): java.lang.OutOfMemoryError: Java heap space
>   at 
> net.jpountz.lz4.LZ4BlockOutputStream.(LZ4BlockOutputStream.java:102)
>   at 
> org.apache.spark.io.LZ4CompressionCodec.compressedOutputStream(CompressionCodec.scala:145)
>   at 
> org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:158)
>   at 
> org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:133)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:122)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:127)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$Lambda$1792/249605067.apply(Unknown
>  Source)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>
>
>


OutOfMemoryError

2021-06-30 Thread javaguy Java
Hi,

I'm getting Java OOM errors even though I'm setting my driver memory to 24g
and I'm executing against local[*]

I was wondering if anyone can give me any insight.  The server this
job is running on has more than enough memory as does the spark
driver.

The final result does write 3 csv files that are 300MB each so there's
no way its coming close to the 24g

>From the OOM, I don't know about the internals of Spark itself to tell
me where this is failing + how I should refactor or change anything

Would appreciate any advice on how I can resolve

Thx


Parameters here:

val spark = SparkSession
  .builder
  .master("local[*]")
  .appName("OOM")
  .config("spark.driver.host", "localhost")
  .config("spark.driver.maxResultSize", "0")
  .config("spark.sql.caseSensitive", "false")
  .config("spark.sql.adaptive.enabled", "true")
  .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
  .config("spark.driver.memory", "24g")
  .getOrCreate()


My OOM errors are below:

driver): java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedOutputStream.(BufferedOutputStream.java:76)
at 
org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:109)
at 
org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:110)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:118)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at 
org.apache.spark.executor.Executor$TaskRunner$$Lambda$1792/1058609963.apply(Unknown
Source)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)




driver): java.lang.OutOfMemoryError: Java heap space
at 
net.jpountz.lz4.LZ4BlockOutputStream.(LZ4BlockOutputStream.java:102)
at 
org.apache.spark.io.LZ4CompressionCodec.compressedOutputStream(CompressionCodec.scala:145)
at 
org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:158)
at 
org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:133)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:122)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at 
org.apache.spark.executor.Executor$TaskRunner$$Lambda$1792/249605067.apply(Unknown
Source)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-12 Thread Jacob Lynn
Thanks for the pointer, Vadim. However, I just tried it with Spark 2.4 and
get the same failure. (I was previously testing with 2.2 and/or 2.3.) And I
don't see this particular issue referred to there.  The ticket that Harel
commented on indeed appears to be the most similar one to this issue:
https://issues.apache.org/jira/browse/SPARK-1239.

On Mon, Nov 11, 2019 at 4:43 PM Vadim Semenov  wrote:

> There's an umbrella ticket for various 2GB limitations
> https://issues.apache.org/jira/browse/SPARK-6235
>
> On Fri, Nov 8, 2019 at 4:11 PM Jacob Lynn  wrote:
> >
> > Sorry for the noise, folks! I understand that reducing the number of
> partitions works around the issue (at the scale I'm working at, anyway) --
> as I mentioned in my initial email -- and I understand the root cause. I'm
> not looking for advice on how to resolve my issue. I'm just pointing out
> that this is a real bug/limitation that impacts real-world use cases, in
> case there is some proper Spark dev out there who is looking for a problem
> to solve.
> >
> > On Fri, Nov 8, 2019 at 2:24 PM Vadim Semenov 
> wrote:
> >>
> >> Basically, the driver tracks partitions and sends it over to
> >> executors, so what it's trying to do is to serialize and compress the
> >> map but because it's so big, it goes over 2GiB and that's Java's limit
> >> on the max size of byte arrays, so the whole thing drops.
> >>
> >> The size of data doesn't matter here much but the number of partitions
> >> is what the root cause of the issue, try reducing it below 3 and
> >> see how it goes.
> >>
> >> On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman 
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > We are running a Spark (2.3.1) job on an EMR cluster with 500
> r3.2xlarge (60 GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB.
> >> >
> >> > It processes ~40 TB of data using aggregateByKey in which we specify
> numPartitions = 300,000.
> >> > Map side tasks succeed, but reduce side tasks all fail.
> >> >
> >> > We notice the following driver error:
> >> >
> >> > 18/09/07 13:35:03 WARN Utils: Suppressing exception in finally: null
> >> >
> >> >  java.lang.OutOfMemoryError
> >> >
> >> > at
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> >> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> >> > at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> >> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> >> > at
> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> >> > at
> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> >> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> >> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
> >> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
> >> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
> >> > at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
> >> > at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
> >> > at
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:790)
> >> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1389)
> >> > at
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
> >> > at
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
> >> > at
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
> >> > at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >> > at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >> > at java.lang.Thread.run(Thread.java:748)
> >> > Exception in thread "map-output-dispatcher-0"
> java.lang.OutOfMemoryError
> >> > at
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> >> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> >> > at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> >> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> >> > at
> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> >> > at
> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> >> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> >> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
> >> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
> >> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> >> > at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
> >> > at 

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-11 Thread Vadim Semenov
There's an umbrella ticket for various 2GB limitations
https://issues.apache.org/jira/browse/SPARK-6235

On Fri, Nov 8, 2019 at 4:11 PM Jacob Lynn  wrote:
>
> Sorry for the noise, folks! I understand that reducing the number of 
> partitions works around the issue (at the scale I'm working at, anyway) -- as 
> I mentioned in my initial email -- and I understand the root cause. I'm not 
> looking for advice on how to resolve my issue. I'm just pointing out that 
> this is a real bug/limitation that impacts real-world use cases, in case 
> there is some proper Spark dev out there who is looking for a problem to 
> solve.
>
> On Fri, Nov 8, 2019 at 2:24 PM Vadim Semenov  
> wrote:
>>
>> Basically, the driver tracks partitions and sends it over to
>> executors, so what it's trying to do is to serialize and compress the
>> map but because it's so big, it goes over 2GiB and that's Java's limit
>> on the max size of byte arrays, so the whole thing drops.
>>
>> The size of data doesn't matter here much but the number of partitions
>> is what the root cause of the issue, try reducing it below 3 and
>> see how it goes.
>>
>> On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman  wrote:
>> >
>> > Hi,
>> >
>> > We are running a Spark (2.3.1) job on an EMR cluster with 500 r3.2xlarge 
>> > (60 GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB.
>> >
>> > It processes ~40 TB of data using aggregateByKey in which we specify 
>> > numPartitions = 300,000.
>> > Map side tasks succeed, but reduce side tasks all fail.
>> >
>> > We notice the following driver error:
>> >
>> > 18/09/07 13:35:03 WARN Utils: Suppressing exception in finally: null
>> >
>> >  java.lang.OutOfMemoryError
>> >
>> > at 
>> > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
>> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
>> > at 
>> > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>> > at 
>> > java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
>> > at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
>> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
>> > at 
>> > java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
>> > at 
>> > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
>> > at 
>> > java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
>> > at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
>> > at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
>> > at 
>> > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:790)
>> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1389)
>> > at 
>> > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
>> > at 
>> > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
>> > at 
>> > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
>> > at 
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> > at 
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> > at java.lang.Thread.run(Thread.java:748)
>> > Exception in thread "map-output-dispatcher-0" java.lang.OutOfMemoryError
>> > at 
>> > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
>> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
>> > at 
>> > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>> > at 
>> > java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
>> > at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
>> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
>> > at 
>> > java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
>> > at 
>> > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
>> > at 
>> > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>> > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
>> > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
>> > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
>> > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>> > at 
>> > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:787)
>> > at 
>> > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:786)
>> > at 
>> > 

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-08 Thread Jacob Lynn
Sorry for the noise, folks! I understand that reducing the number of
partitions works around the issue (at the scale I'm working at, anyway) --
as I mentioned in my initial email -- and I understand the root cause. I'm
not looking for advice on how to resolve my issue. I'm just pointing out
that this is a real bug/limitation that impacts real-world use cases, in
case there is some proper Spark dev out there who is looking for a problem
to solve.

On Fri, Nov 8, 2019 at 2:24 PM Vadim Semenov 
wrote:

> Basically, the driver tracks partitions and sends it over to
> executors, so what it's trying to do is to serialize and compress the
> map but because it's so big, it goes over 2GiB and that's Java's limit
> on the max size of byte arrays, so the whole thing drops.
>
> The size of data doesn't matter here much but the number of partitions
> is what the root cause of the issue, try reducing it below 3 and
> see how it goes.
>
> On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman 
> wrote:
> >
> > Hi,
> >
> > We are running a Spark (2.3.1) job on an EMR cluster with 500 r3.2xlarge
> (60 GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB.
> >
> > It processes ~40 TB of data using aggregateByKey in which we specify
> numPartitions = 300,000.
> > Map side tasks succeed, but reduce side tasks all fail.
> >
> > We notice the following driver error:
> >
> > 18/09/07 13:35:03 WARN Utils: Suppressing exception in finally: null
> >
> >  java.lang.OutOfMemoryError
> >
> > at
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> > at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> > at
> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> > at
> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
> > at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
> > at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
> > at
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:790)
> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1389)
> > at
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
> > at
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
> > at
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
> > at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > at java.lang.Thread.run(Thread.java:748)
> > Exception in thread "map-output-dispatcher-0" java.lang.OutOfMemoryError
> > at
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> > at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> > at
> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> > at
> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
> > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> > at
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:787)
> > at
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:786)
> > at
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:786)
> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380)
> > at
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
> > at
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
> > at
> 

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-08 Thread Vadim Semenov
Basically, the driver tracks partitions and sends it over to
executors, so what it's trying to do is to serialize and compress the
map but because it's so big, it goes over 2GiB and that's Java's limit
on the max size of byte arrays, so the whole thing drops.

The size of data doesn't matter here much but the number of partitions
is what the root cause of the issue, try reducing it below 3 and
see how it goes.

On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman  wrote:
>
> Hi,
>
> We are running a Spark (2.3.1) job on an EMR cluster with 500 r3.2xlarge (60 
> GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB.
>
> It processes ~40 TB of data using aggregateByKey in which we specify 
> numPartitions = 300,000.
> Map side tasks succeed, but reduce side tasks all fail.
>
> We notice the following driver error:
>
> 18/09/07 13:35:03 WARN Utils: Suppressing exception in finally: null
>
>  java.lang.OutOfMemoryError
>
> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
> at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
> at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:790)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1389)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Exception in thread "map-output-dispatcher-0" java.lang.OutOfMemoryError
> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:787)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:786)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:786)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Suppressed: java.lang.OutOfMemoryError
> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at 

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-08 Thread Jacob Lynn
File system is HDFS. Executors are 2 cores, 14GB RAM. But I don't think
either of these relate to the problem -- this is a memory allocation issue
on the driver side, and happens in an intermediate stage that has no HDFS
read/write.

On Fri, Nov 8, 2019 at 10:01 AM Spico Florin  wrote:

> Hi!
> What file system are you using: EMRFS or HDFS?
> Also what memory are you using for the reducer ?
>
> On Thu, Nov 7, 2019 at 8:37 PM abeboparebop 
> wrote:
>
>> I ran into the same issue processing 20TB of data, with 200k tasks on both
>> the map and reduce sides. Reducing to 100k tasks each resolved the issue.
>> But this could/would be a major problem in cases where the data is bigger
>> or
>> the computation is heavier, since reducing the number of partitions may
>> not
>> be an option.
>>
>>
>> harelglik wrote
>> > I understand the error is because the number of partitions is very high,
>> > yet when processing 40 TB (and this number is expected to grow) this
>> > number
>> > seems reasonable:
>> > 40TB / 300,000 will result in partitions size of ~ 130MB (data should be
>> > evenly distributed).
>> >
>> > On Fri, Sep 7, 2018 at 6:28 PM Vadim Semenov 
>>
>> > vadim@
>>
>> >  wrote:
>> >
>> >> You have too many partitions, so when the driver is trying to gather
>> >> the status of all map outputs and send back to executors it chokes on
>> >> the size of the structure that needs to be GZipped, and since it's
>> >> bigger than 2GiB, it produces OOM.
>> >> On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman 
>>
>> > harelglik@
>>
>> > 
>> >> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > We are running a Spark (2.3.1) job on an EMR cluster with 500
>> >> r3.2xlarge
>> >> (60 GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB.
>> >> >
>> >> > It processes ~40 TB of data using aggregateByKey in which we specify
>> >> numPartitions = 300,000.
>> >> > Map side tasks succeed, but reduce side tasks all fail.
>> >> >
>> >> > We notice the following driver error:
>> >> >
>> >> > 18/09/07 13:35:03 WARN Utils: Suppressing exception in finally: null
>> >> >
>> >> >  java.lang.OutOfMemoryError
>> >> >
>> >> > at
>> >>
>> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
>> >> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
>> >> > at
>> >>
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>> >> > at
>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>> >> > at
>> >>
>> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
>> >> > at
>> >> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
>> >> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
>> >> > at
>> >>
>> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
>> >> > at
>> >>
>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
>> >> > at
>> >>
>> java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
>> >> > at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
>> >> > at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
>> >> > at
>> >>
>> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:790)
>> >> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1389)
>> >> > at
>> >>
>> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
>> >> > at
>> >>
>> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
>> >> > at
>> >>
>> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
>> >> > at
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> >> > at
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> >> > at java.lang.Thread.run(Thread.java:748)
>> >> > Exception in thread "map-output-dispatcher-0"
>> >> java.lang.OutOfMemoryError
>> >> > at
>> >>
>> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
>> >> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
>> >> > at
>> >>
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>> >> > at
>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>> >> > at
>> >>
>> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
>> >> > at
>> >> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
>> >> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
>> >> > at
>> >>
>> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
>> >> > at
>> >>
>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
>> >> > at
>> >>
>> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>> >> > at

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-08 Thread Spico Florin
Hi!
What file system are you using: EMRFS or HDFS?
Also what memory are you using for the reducer ?

On Thu, Nov 7, 2019 at 8:37 PM abeboparebop  wrote:

> I ran into the same issue processing 20TB of data, with 200k tasks on both
> the map and reduce sides. Reducing to 100k tasks each resolved the issue.
> But this could/would be a major problem in cases where the data is bigger
> or
> the computation is heavier, since reducing the number of partitions may not
> be an option.
>
>
> harelglik wrote
> > I understand the error is because the number of partitions is very high,
> > yet when processing 40 TB (and this number is expected to grow) this
> > number
> > seems reasonable:
> > 40TB / 300,000 will result in partitions size of ~ 130MB (data should be
> > evenly distributed).
> >
> > On Fri, Sep 7, 2018 at 6:28 PM Vadim Semenov 
>
> > vadim@
>
> >  wrote:
> >
> >> You have too many partitions, so when the driver is trying to gather
> >> the status of all map outputs and send back to executors it chokes on
> >> the size of the structure that needs to be GZipped, and since it's
> >> bigger than 2GiB, it produces OOM.
> >> On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman 
>
> > harelglik@
>
> > 
> >> wrote:
> >> >
> >> > Hi,
> >> >
> >> > We are running a Spark (2.3.1) job on an EMR cluster with 500
> >> r3.2xlarge
> >> (60 GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB.
> >> >
> >> > It processes ~40 TB of data using aggregateByKey in which we specify
> >> numPartitions = 300,000.
> >> > Map side tasks succeed, but reduce side tasks all fail.
> >> >
> >> > We notice the following driver error:
> >> >
> >> > 18/09/07 13:35:03 WARN Utils: Suppressing exception in finally: null
> >> >
> >> >  java.lang.OutOfMemoryError
> >> >
> >> > at
> >>
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> >> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> >> > at
> >>
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> >> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> >> > at
> >>
> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> >> > at
> >> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> >> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> >> > at
> >>
> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
> >> > at
> >>
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
> >> > at
> >>
> java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
> >> > at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
> >> > at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
> >> > at
> >>
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:790)
> >> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1389)
> >> > at
> >>
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
> >> > at
> >>
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
> >> > at
> >>
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
> >> > at
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >> > at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >> > at java.lang.Thread.run(Thread.java:748)
> >> > Exception in thread "map-output-dispatcher-0"
> >> java.lang.OutOfMemoryError
> >> > at
> >>
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> >> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> >> > at
> >>
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> >> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> >> > at
> >>
> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> >> > at
> >> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> >> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> >> > at
> >>
> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
> >> > at
> >>
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
> >> > at
> >>
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> >> > at
> >> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
> >> > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> >> > at
> >> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> >> > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> >> > at
> >>
> 

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-07 Thread abeboparebop
I ran into the same issue processing 20TB of data, with 200k tasks on both
the map and reduce sides. Reducing to 100k tasks each resolved the issue.
But this could/would be a major problem in cases where the data is bigger or
the computation is heavier, since reducing the number of partitions may not
be an option.


harelglik wrote
> I understand the error is because the number of partitions is very high,
> yet when processing 40 TB (and this number is expected to grow) this
> number
> seems reasonable:
> 40TB / 300,000 will result in partitions size of ~ 130MB (data should be
> evenly distributed).
> 
> On Fri, Sep 7, 2018 at 6:28 PM Vadim Semenov 

> vadim@

>  wrote:
> 
>> You have too many partitions, so when the driver is trying to gather
>> the status of all map outputs and send back to executors it chokes on
>> the size of the structure that needs to be GZipped, and since it's
>> bigger than 2GiB, it produces OOM.
>> On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman 

> harelglik@

> 
>> wrote:
>> >
>> > Hi,
>> >
>> > We are running a Spark (2.3.1) job on an EMR cluster with 500
>> r3.2xlarge
>> (60 GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB.
>> >
>> > It processes ~40 TB of data using aggregateByKey in which we specify
>> numPartitions = 300,000.
>> > Map side tasks succeed, but reduce side tasks all fail.
>> >
>> > We notice the following driver error:
>> >
>> > 18/09/07 13:35:03 WARN Utils: Suppressing exception in finally: null
>> >
>> >  java.lang.OutOfMemoryError
>> >
>> > at
>> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
>> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
>> > at
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>> > at
>> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
>> > at
>> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
>> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
>> > at
>> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
>> > at
>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
>> > at
>> java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
>> > at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
>> > at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
>> > at
>> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:790)
>> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1389)
>> > at
>> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
>> > at
>> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
>> > at
>> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
>> > at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> > at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> > at java.lang.Thread.run(Thread.java:748)
>> > Exception in thread "map-output-dispatcher-0"
>> java.lang.OutOfMemoryError
>> > at
>> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
>> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
>> > at
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>> > at
>> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
>> > at
>> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
>> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
>> > at
>> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
>> > at
>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
>> > at
>> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>> > at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
>> > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
>> > at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
>> > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>> > at
>> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:787)
>> > at
>> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:786)
>> > at
>> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:786)
>> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380)
>> > at
>> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2018-09-07 Thread Harel Gliksman
I understand the error is because the number of partitions is very high,
yet when processing 40 TB (and this number is expected to grow) this number
seems reasonable:
40TB / 300,000 will result in partitions size of ~ 130MB (data should be
evenly distributed).

On Fri, Sep 7, 2018 at 6:28 PM Vadim Semenov  wrote:

> You have too many partitions, so when the driver is trying to gather
> the status of all map outputs and send back to executors it chokes on
> the size of the structure that needs to be GZipped, and since it's
> bigger than 2GiB, it produces OOM.
> On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman 
> wrote:
> >
> > Hi,
> >
> > We are running a Spark (2.3.1) job on an EMR cluster with 500 r3.2xlarge
> (60 GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB.
> >
> > It processes ~40 TB of data using aggregateByKey in which we specify
> numPartitions = 300,000.
> > Map side tasks succeed, but reduce side tasks all fail.
> >
> > We notice the following driver error:
> >
> > 18/09/07 13:35:03 WARN Utils: Suppressing exception in finally: null
> >
> >  java.lang.OutOfMemoryError
> >
> > at
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> > at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> > at
> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> > at
> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
> > at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
> > at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
> > at
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:790)
> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1389)
> > at
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
> > at
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
> > at
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
> > at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > at java.lang.Thread.run(Thread.java:748)
> > Exception in thread "map-output-dispatcher-0" java.lang.OutOfMemoryError
> > at
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> > at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> > at
> java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> > at
> java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> > at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
> > at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
> > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> > at
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:787)
> > at
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:786)
> > at
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:786)
> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380)
> > at
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
> > at
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
> > at
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
> > at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > at java.lang.Thread.run(Thread.java:748)
> > Suppressed: java.lang.OutOfMemoryError
> > at
> 

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2018-09-07 Thread Vadim Semenov
You have too many partitions, so when the driver is trying to gather
the status of all map outputs and send back to executors it chokes on
the size of the structure that needs to be GZipped, and since it's
bigger than 2GiB, it produces OOM.
On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman  wrote:
>
> Hi,
>
> We are running a Spark (2.3.1) job on an EMR cluster with 500 r3.2xlarge (60 
> GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB.
>
> It processes ~40 TB of data using aggregateByKey in which we specify 
> numPartitions = 300,000.
> Map side tasks succeed, but reduce side tasks all fail.
>
> We notice the following driver error:
>
> 18/09/07 13:35:03 WARN Utils: Suppressing exception in finally: null
>
>  java.lang.OutOfMemoryError
>
> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
> at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
> at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:790)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1389)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Exception in thread "map-output-dispatcher-0" java.lang.OutOfMemoryError
> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:787)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:786)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:786)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Suppressed: java.lang.OutOfMemoryError
> at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
> at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
> at 

Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2018-09-07 Thread Harel Gliksman
Hi,

We are running a Spark (2.3.1) job on an EMR cluster with 500 r3.2xlarge
(60 GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB.

It processes ~40 TB of data using aggregateByKey in which we specify
numPartitions = 300,000.
Map side tasks succeed, but reduce side tasks all fail.

We notice the following driver error:

18/09/07 13:35:03 WARN Utils: Suppressing exception in finally: null

 java.lang.OutOfMemoryError

at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at 
java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
at 
java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:790)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1389)
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
at 
org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "map-output-dispatcher-0" java.lang.OutOfMemoryError
at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at 
java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
at 
java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:787)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:786)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:786)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1380)
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:789)
at 
org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:174)
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:397)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.lang.OutOfMemoryError
at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at 
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at 
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at 
java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
at 
java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
  

[Structured Streaming] HDFSBackedStateStoreProvider OutOfMemoryError

2018-03-30 Thread ahmed alobaidi
Hi All,

I'm working on simple structured streaming query that
uses flatMapGroupsWithState to maintain relatively a large size state.

After running the application for few minutes on my local machine, it
starts to slow down and then crashes with OutOfMemoryError.

Tracking the code led me to HDFSBackedStateStoreProvider. It seems the
provider loads all the data for all of its StateStores with versions in a
ConcurrentHashMap (loadedMaps) and never free that up unless the provider
itself is closed.

While I can see the performance advantage when having small size of state,
for other use cases this approach seems not to be scalable. It might make
sense to load in memory the StateStores that are needed for the active
tasks and unload them when task is done. This way user can divide state in
larger number of partitions to make it fit in memory.

I was able to get the application to work without memory problems by adding
code in HDFSBackedStateStore.commit() that clears loadedMaps and its
content, but I'm pretty sure this will introduce bugs related to
concurrency.

Not sure if I'm missing something or there is a way to configure the
behavior of HDFSBackedStateStoreProvider memory allocation.

Thanks,
Ahmed


Java heap space OutOfMemoryError in pyspark spark-submit (spark version:2.2)

2018-01-04 Thread Anu B Nair
Hi,

I have a data set size of 10GB(example Test.txt).

I wrote my pyspark script like below(Test.py):

*from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
spark = SparkSession.builder.appName("FilterProduct").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
lines = spark.read.text("C:/Users/test/Desktop/Test.txt").rdd
lines.collect()*

Then I am executing the above script using below command :

spark-submit Test.py --executor-memory  15G --driver-memory 15G

Then I am getting error like below:



*17/12/29 13:27:18 INFO FileScanRDD: Reading File path:
file:///C:/Users/test/Desktop/Test.txt, range: 402653184-536870912,
partition values: [empty row]
17/12/29 13:27:18 INFO CodeGenerator: Code generated in 22.743725 ms
17/12/29 13:27:44 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3230)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/12/29 13:27:44 ERROR Executor: Exception in task 2.0 in stage 0.0 (TID 2)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3230)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93*

Please let me know how to resolve this ?

--


Anu


OutOfMemoryError

2017-06-23 Thread Tw UxTLi51Nus

Hi,

I have a dataset with ~5M rows x 20 columns, containing a groupID and a 
rowID. My goal is to check whether (some) columns contain more than a 
fixed fraction (say, 50%) of missing (null) values within a group. If 
this is found, the entire column is set to missing (null), for that 
group.


The Problem:
The loop runs like a charm during the first iterations, but towards the 
end, around the 6th or 7th iteration I see my CPU utilization dropping 
(using 1 instead of 6 cores). Along with that, execution time for one 
iteration increases significantly. At some point, I get an OutOfMemory 
Error:


* spark.driver.memory < 4G: at collect() (FAIL 1)
* 4G <= spark.driver.memory < 10G: at the count() step (FAIL 2)

Enabling a HeapDump on OOM (and analyzing it with Eclipse MAT) showed 
two classes taking up lots of memory:


* java.lang.Thread
  - char (2G)
  - scala.collection.IndexedSeqLike
  - scala.collection.mutable.WrappedArray (1G)
  - java.lang.String (1G)

* org.apache.spark.sql.execution.ui.SQLListener
  - org.apache.spark.sql.execution.ui.SQLExecutionUIData
(various of up to 1G in size)
  - java.lang.String
  - ...

Turning off the SparkUI and/or setting spark.ui.retainedXXX to something 
low (e.g. 1) did not solve the issue.


Any idea what I am doing wrong? Or is this a bug?

My Code can be found as a Github Gist [0]. More details can be found on 
the StackOverflow Question [1] I posted, but did not receive any answers 
until now.


Thanks!

[0] 
https://gist.github.com/TwUxTLi51Nus/4accdb291494be9201abfad72541ce74
[1] 
http://stackoverflow.com/questions/43637913/apache-spark-outofmemoryerror-heapspace


PS: As a workaround, I have been using "checkpoint" after every few 
iterations.



--
Tw UxTLi51Nus
Email: twuxtli51...@posteo.co


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



OutOfMemoryError

2017-05-02 Thread TwUxTLi51Nus

Hi Spark Users,

I have a dataset with ~5M rows x 20 columns, containing a groupID and a 
rowID. My goal is to check whether (some) columns contain more than a 
fixed fraction (say, 50%) of missing (null) values within a group. If 
this is found, the entire column is set to missing (null), for that 
group.


The Problem:
The loop runs like a charm during the first iterations, but towards the 
end, around the 6th or 7th iteration I see my CPU utilization dropping 
(using 1 instead of 6 cores). Along with that, execution time for one 
iteration increases significantly. At some point, I get an OutOfMemory 
Error:


* spark.driver.memory < 4G: at collect() (FAIL 1)
* 4G <= spark.driver.memory < 10G: at the count() step (FAIL 2)

Enabling a HeapDump on OOM (and analyzing it with Eclipse MAT) showed 
two classes taking up lots of memory:


* java.lang.Thread
  - char (2G)
  - scala.collection.IndexedSeqLike
  - scala.collection.mutable.WrappedArray (1G)
  - java.lang.String (1G)

* org.apache.spark.sql.execution.ui.SQLListener
  - org.apache.spark.sql.execution.ui.SQLExecutionUIData
(various of up to 1G in size)
  - java.lang.String
  - ...

Turning off the SparkUI and/or setting spark.ui.retainedXXX to something 
low (e.g. 1) did not solve the issue.


Any idea what I am doing wrong? Or is this a bug?

My Code can be found as a Github Gist [0]. More details can be found on 
the StackOverflow Question [1] I posted, but did not receive any answers 
until now.


Thanks!

[0] 
https://gist.github.com/TwUxTLi51Nus/4accdb291494be9201abfad72541ce74
[1] 
http://stackoverflow.com/questions/43637913/apache-spark-outofmemoryerror-heapspace


PS: As a workaround, I have been writing and reading temporary parquet 
files on each loop iteration.



--
Tw UxTLi51Nus
Email: twuxtli51...@posteo.de

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
Not sure if anyone else here can help you. But if I were you, I will adjust 
SPARK_DAEMON_MEMORY to 2g, to bump the worker to 2G. Even though the worker's 
responsibility is very limited, but in today's world, who knows. Give 2g a try 
to see if the problem goes away.


BTW, in our production, I set the worker to 2g, and never experienced any OOM 
from workers. Our cluster is live for more than 1 year, and we also use Spark 
1.6.2 on production.


Yong



From: Behroz Sikander <behro...@gmail.com>
Sent: Friday, March 24, 2017 9:29 AM
To: Yong Zhang
Cc: user@spark.apache.org
Subject: Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

Yea we also didn't find anything related to this online.

Are you aware of any memory leaks in worker in 1.6.2 spark which might be 
causing this ?
Do you know of any documentation which explains all the tasks that a worker is 
performing ? Maybe we can get some clue from there.

Regards,
Behroz

On Fri, Mar 24, 2017 at 2:21 PM, Yong Zhang 
<java8...@hotmail.com<mailto:java8...@hotmail.com>> wrote:

I never experienced worker OOM or very rarely see this online. So my guess that 
you have to generate the heap dump file to analyze it.


Yong



From: Behroz Sikander <behro...@gmail.com<mailto:behro...@gmail.com>>
Sent: Friday, March 24, 2017 9:15 AM
To: Yong Zhang
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

Thank you for the response.

Yes, I am sure because the driver was working fine. Only 2 workers went down 
with OOM.

Regards,
Behroz

On Fri, Mar 24, 2017 at 2:12 PM, Yong Zhang 
<java8...@hotmail.com<mailto:java8...@hotmail.com>> wrote:

I am not 100% sure, but normally "dispatcher-event-loop" OOM means the driver 
OOM. Are you sure your workers OOM?


Yong



From: bsikander <behro...@gmail.com<mailto:behro...@gmail.com>>
Sent: Friday, March 24, 2017 5:48 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

Spark version: 1.6.2
Hadoop: 2.6.0

Cluster:
All VMS are deployed on AWS.
1 Master (t2.large)
1 Secondary Master (t2.large)
5 Workers (m4.xlarge)
Zookeeper (t2.large)

Recently, 2 of our workers went down with out of memory exception.
java.lang.OutOfMemoryError: GC overhead limit exceeded (max heap: 1024 MB)

Both of these worker processes were in hanged state. We restarted them to
bring them back to normal state.

Here is the complete exception
https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91
[https://avatars1.githubusercontent.com/u/4642104?v=3=400]<https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91>

Worker 
crashing<https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91>
gist.github.com<http://gist.github.com>
Worker crashing




Master's spark-default.conf file:
https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d
[https://avatars1.githubusercontent.com/u/4642104?v=3=400]<https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d>

Default Configuration file for 
MASTER<https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d>
gist.github.com<http://gist.github.com>
Default Configuration file for MASTER




Master's spark-env.sh
https://gist.github.com/bsikander/42f76d7a8e4079098d8a2df3cdee8ee0

Slave's spark-default.conf file:
https://gist.github.com/bsikander/54264349b49e6227c6912eb14d344b8c

So, what could be the reason of our workers crashing due to OutOfMemory ?
How can we avoid that in future.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Worker-Crashing-OutOfMemoryError-GC-overhead-limit-execeeded-tp28535.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>





Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Behroz Sikander
Yea we also didn't find anything related to this online.

Are you aware of any memory leaks in worker in 1.6.2 spark which might be
causing this ?
Do you know of any documentation which explains all the tasks that a worker
is performing ? Maybe we can get some clue from there.

Regards,
Behroz

On Fri, Mar 24, 2017 at 2:21 PM, Yong Zhang <java8...@hotmail.com> wrote:

> I never experienced worker OOM or very rarely see this online. So my guess
> that you have to generate the heap dump file to analyze it.
>
>
> Yong
>
>
> --
> *From:* Behroz Sikander <behro...@gmail.com>
> *Sent:* Friday, March 24, 2017 9:15 AM
> *To:* Yong Zhang
> *Cc:* user@spark.apache.org
> *Subject:* Re: [Worker Crashing] OutOfMemoryError: GC overhead limit
> execeeded
>
> Thank you for the response.
>
> Yes, I am sure because the driver was working fine. Only 2 workers went
> down with OOM.
>
> Regards,
> Behroz
>
> On Fri, Mar 24, 2017 at 2:12 PM, Yong Zhang <java8...@hotmail.com> wrote:
>
>> I am not 100% sure, but normally "dispatcher-event-loop" OOM means the
>> driver OOM. Are you sure your workers OOM?
>>
>>
>> Yong
>>
>>
>> ------
>> *From:* bsikander <behro...@gmail.com>
>> *Sent:* Friday, March 24, 2017 5:48 AM
>> *To:* user@spark.apache.org
>> *Subject:* [Worker Crashing] OutOfMemoryError: GC overhead limit
>> execeeded
>>
>> Spark version: 1.6.2
>> Hadoop: 2.6.0
>>
>> Cluster:
>> All VMS are deployed on AWS.
>> 1 Master (t2.large)
>> 1 Secondary Master (t2.large)
>> 5 Workers (m4.xlarge)
>> Zookeeper (t2.large)
>>
>> Recently, 2 of our workers went down with out of memory exception.
>> java.lang.OutOfMemoryError: GC overhead limit exceeded (max heap: 1024 MB)
>>
>> Both of these worker processes were in hanged state. We restarted them to
>> bring them back to normal state.
>>
>> Here is the complete exception
>> https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91
>> <https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91>
>> Worker crashing
>> <https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91>
>> gist.github.com
>> Worker crashing
>>
>>
>>
>> Master's spark-default.conf file:
>> https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d
>> <https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d>
>> Default Configuration file for MASTER
>> <https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d>
>> gist.github.com
>> Default Configuration file for MASTER
>>
>>
>>
>> Master's spark-env.sh
>> https://gist.github.com/bsikander/42f76d7a8e4079098d8a2df3cdee8ee0
>>
>> Slave's spark-default.conf file:
>> https://gist.github.com/bsikander/54264349b49e6227c6912eb14d344b8c
>>
>> So, what could be the reason of our workers crashing due to OutOfMemory ?
>> How can we avoid that in future.
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Worker-Crashing-OutOfMemoryError-GC-ov
>> erhead-limit-execeeded-tp28535.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
I never experienced worker OOM or very rarely see this online. So my guess that 
you have to generate the heap dump file to analyze it.


Yong



From: Behroz Sikander <behro...@gmail.com>
Sent: Friday, March 24, 2017 9:15 AM
To: Yong Zhang
Cc: user@spark.apache.org
Subject: Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

Thank you for the response.

Yes, I am sure because the driver was working fine. Only 2 workers went down 
with OOM.

Regards,
Behroz

On Fri, Mar 24, 2017 at 2:12 PM, Yong Zhang 
<java8...@hotmail.com<mailto:java8...@hotmail.com>> wrote:

I am not 100% sure, but normally "dispatcher-event-loop" OOM means the driver 
OOM. Are you sure your workers OOM?


Yong



From: bsikander <behro...@gmail.com<mailto:behro...@gmail.com>>
Sent: Friday, March 24, 2017 5:48 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

Spark version: 1.6.2
Hadoop: 2.6.0

Cluster:
All VMS are deployed on AWS.
1 Master (t2.large)
1 Secondary Master (t2.large)
5 Workers (m4.xlarge)
Zookeeper (t2.large)

Recently, 2 of our workers went down with out of memory exception.
java.lang.OutOfMemoryError: GC overhead limit exceeded (max heap: 1024 MB)

Both of these worker processes were in hanged state. We restarted them to
bring them back to normal state.

Here is the complete exception
https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91
[https://avatars1.githubusercontent.com/u/4642104?v=3=400]<https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91>

Worker 
crashing<https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91>
gist.github.com<http://gist.github.com>
Worker crashing




Master's spark-default.conf file:
https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d
[https://avatars1.githubusercontent.com/u/4642104?v=3=400]<https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d>

Default Configuration file for 
MASTER<https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d>
gist.github.com<http://gist.github.com>
Default Configuration file for MASTER




Master's spark-env.sh
https://gist.github.com/bsikander/42f76d7a8e4079098d8a2df3cdee8ee0

Slave's spark-default.conf file:
https://gist.github.com/bsikander/54264349b49e6227c6912eb14d344b8c

So, what could be the reason of our workers crashing due to OutOfMemory ?
How can we avoid that in future.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Worker-Crashing-OutOfMemoryError-GC-overhead-limit-execeeded-tp28535.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>




Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Behroz Sikander
Thank you for the response.

Yes, I am sure because the driver was working fine. Only 2 workers went
down with OOM.

Regards,
Behroz

On Fri, Mar 24, 2017 at 2:12 PM, Yong Zhang <java8...@hotmail.com> wrote:

> I am not 100% sure, but normally "dispatcher-event-loop" OOM means the
> driver OOM. Are you sure your workers OOM?
>
>
> Yong
>
>
> --
> *From:* bsikander <behro...@gmail.com>
> *Sent:* Friday, March 24, 2017 5:48 AM
> *To:* user@spark.apache.org
> *Subject:* [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded
>
> Spark version: 1.6.2
> Hadoop: 2.6.0
>
> Cluster:
> All VMS are deployed on AWS.
> 1 Master (t2.large)
> 1 Secondary Master (t2.large)
> 5 Workers (m4.xlarge)
> Zookeeper (t2.large)
>
> Recently, 2 of our workers went down with out of memory exception.
> java.lang.OutOfMemoryError: GC overhead limit exceeded (max heap: 1024 MB)
>
> Both of these worker processes were in hanged state. We restarted them to
> bring them back to normal state.
>
> Here is the complete exception
> https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91
> <https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91>
> Worker crashing
> <https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91>
> gist.github.com
> Worker crashing
>
>
>
> Master's spark-default.conf file:
> https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d
> <https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d>
> Default Configuration file for MASTER
> <https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d>
> gist.github.com
> Default Configuration file for MASTER
>
>
>
> Master's spark-env.sh
> https://gist.github.com/bsikander/42f76d7a8e4079098d8a2df3cdee8ee0
>
> Slave's spark-default.conf file:
> https://gist.github.com/bsikander/54264349b49e6227c6912eb14d344b8c
>
> So, what could be the reason of our workers crashing due to OutOfMemory ?
> How can we avoid that in future.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Worker-Crashing-OutOfMemoryError-GC-
> overhead-limit-execeeded-tp28535.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
I am not 100% sure, but normally "dispatcher-event-loop" OOM means the driver 
OOM. Are you sure your workers OOM?


Yong



From: bsikander <behro...@gmail.com>
Sent: Friday, March 24, 2017 5:48 AM
To: user@spark.apache.org
Subject: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

Spark version: 1.6.2
Hadoop: 2.6.0

Cluster:
All VMS are deployed on AWS.
1 Master (t2.large)
1 Secondary Master (t2.large)
5 Workers (m4.xlarge)
Zookeeper (t2.large)

Recently, 2 of our workers went down with out of memory exception.
java.lang.OutOfMemoryError: GC overhead limit exceeded (max heap: 1024 MB)

Both of these worker processes were in hanged state. We restarted them to
bring them back to normal state.

Here is the complete exception
https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91
[https://avatars1.githubusercontent.com/u/4642104?v=3=400]<https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91>

Worker 
crashing<https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91>
gist.github.com
Worker crashing




Master's spark-default.conf file:
https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d
[https://avatars1.githubusercontent.com/u/4642104?v=3=400]<https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d>

Default Configuration file for 
MASTER<https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d>
gist.github.com
Default Configuration file for MASTER




Master's spark-env.sh
https://gist.github.com/bsikander/42f76d7a8e4079098d8a2df3cdee8ee0

Slave's spark-default.conf file:
https://gist.github.com/bsikander/54264349b49e6227c6912eb14d344b8c

So, what could be the reason of our workers crashing due to OutOfMemory ?
How can we avoid that in future.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Worker-Crashing-OutOfMemoryError-GC-overhead-limit-execeeded-tp28535.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread bsikander
Spark version: 1.6.2
Hadoop: 2.6.0

Cluster:
All VMS are deployed on AWS.
1 Master (t2.large)
1 Secondary Master (t2.large)
5 Workers (m4.xlarge)
Zookeeper (t2.large)

Recently, 2 of our workers went down with out of memory exception. 
java.lang.OutOfMemoryError: GC overhead limit exceeded (max heap: 1024 MB)

Both of these worker processes were in hanged state. We restarted them to
bring them back to normal state.

Here is the complete exception 
https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91

Master's spark-default.conf file: 
https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d

Master's spark-env.sh
https://gist.github.com/bsikander/42f76d7a8e4079098d8a2df3cdee8ee0

Slave's spark-default.conf file:
https://gist.github.com/bsikander/54264349b49e6227c6912eb14d344b8c

So, what could be the reason of our workers crashing due to OutOfMemory ?
How can we avoid that in future.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Worker-Crashing-OutOfMemoryError-GC-overhead-limit-execeeded-tp28535.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-23 Thread Behroz Sikander
Hello,
Spark version: 1.6.2
Hadoop: 2.6.0

Cluster:
All VMS are deployed on AWS.
1 Master (t2.large)
1 Secondary Master (t2.large)
5 Workers (m4.xlarge)
Zookeeper (t2.large)

Recently, 2 of our workers went down with out of memory exception.

> java.lang.OutOfMemoryError: GC overhead limit exceeded (max heap: 1024 MB)


Both of these worker processes were in hanged state. We restarted them to
bring them back to normal state.

Here is the complete exception
https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91

Master's spark-default.conf file:
https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d

Master's spark-env.sh
https://gist.github.com/bsikander/42f76d7a8e4079098d8a2df3cdee8ee0

Slave's spark-default.conf file:
https://gist.github.com/bsikander/54264349b49e6227c6912eb14d344b8c

So, what could be the reason of our workers crashing due to OutOfMemory ?
How can we avoid that in future.

Regards,
Behroz


OutOfMemoryError while running job...

2016-12-06 Thread Kevin Burton
I am trying to run a Spark job which reads from ElasticSearch and should
write it's output back to a separate ElasticSearch index. Unfortunately I
keep getting `java.lang.OutOfMemoryError: Java heap space` exceptions. I've
tried running it with: --conf spark.memory.offHeap.enabled=true --conf
spark.memory.offHeap.size=2147483648 <(214)%20748-3648> --conf
spark.executor.memory=4g. That didn't help though.

I use Spark version: 2.0.0, 55 worker nodes, ElasticSearch version: 2.3.3,
Scala version 2.11.8, Java 1.8.0_60.

scala> unique_authors.saveToEs("top_users_2016_11_29_to_2016_12_05/user")
[Stage 1:> (0 + 108) /
2048]16/12/06 03:19:40 WARN TaskSetManager: Lost task 78.0 in stage 1.0
(TID 148, 136.243.58.230): java.lang.OutOfMemoryError: Java heap space
at org.spark_project.guava.collect.Ordering.leastOf(
Ordering.java:657)
at org.apache.spark.util.collection.Utils$.takeOrdered(
Utils.scala:37)
at org.apache.spark.sql.execution.TakeOrderedAndProjectExec$$
anonfun$4.apply(limit.scala:143)
at org.apache.spark.sql.execution.TakeOrderedAndProjectExec$$
anonfun$4.apply(limit.scala:142)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$
anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$
anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.MapPartitionsRDD.compute(
MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(
MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(
ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(
ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(
Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Links to logs:
Spark-shell log: https://gist.github.com/lakomiec/
e53f8e3f0a7227f751978f5ad95b6c52
Content of compute-top-unique-users.scala: https://gist.github.com/lakomiec/
23e221131554fc9e726f7d6cdc5b88b5
Exception on worker node: https://gist.github.com/lakomiec/
560ab486eed981fd864086189afb413e


... one additional thing to add.

We tried:

content = content.persist(StorageLevel.MEMORY_AND_DISK)

but that didn't seem to have any impact...

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Hyukjin Kwon
It seems a bit weird. Could we open an issue and talk in the repository
link I sent?

Let me try to reproduce your case with your data if possible.

On 17 Nov 2016 2:26 a.m., "Arun Patel"  wrote:

> I tried below options.
>
> 1) Increase executor memory.  Increased up to maximum possibility 14GB.
> Same error.
> 2) Tried new version - spark-xml_2.10:0.4.1.  Same error.
> 3) Tried with low level rowTags.  It worked for lower level rowTag and
> returned 16000 rows.
>
> Are there any workarounds for this issue?  I tried playing with 
> spark.memory.fraction
> and spark.memory.storageFraction.  But, it did not help.  Appreciate your
> help on this!!!
>
>
>
> On Tue, Nov 15, 2016 at 8:44 PM, Arun Patel 
> wrote:
>
>> Thanks for the quick response.
>>
>> Its a single XML file and I am using a top level rowTag.  So, it creates
>> only one row in a Dataframe with 5 columns. One of these columns will
>> contain most of the data as StructType.  Is there a limitation to store
>> data in a cell of a Dataframe?
>>
>> I will check with new version and try to use different rowTags and
>> increase executor-memory tomorrow. I will open a new issue as well.
>>
>>
>>
>> On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon 
>> wrote:
>>
>>> Hi Arun,
>>>
>>>
>>> I have few questions.
>>>
>>> Dose your XML file have like few huge documents? In this case of a row
>>> having a huge size like (like 500MB), it would consume a lot of memory
>>>
>>> becuase at least it should hold a row to iterate if I remember
>>> correctly. I remember this happened to me before while processing a huge
>>> record for test purpose.
>>>
>>>
>>> How about trying to increase --executor-memory?
>>>
>>>
>>> Also, you could try to select only few fields to prune the data with the
>>> latest version just to doubly sure if you don't mind?.
>>>
>>>
>>> Lastly, do you mind if I ask to open an issue in
>>> https://github.com/databricks/spark-xml/issues if you still face this
>>> problem?
>>>
>>> I will try to take a look at my best.
>>>
>>>
>>> Thank you.
>>>
>>>
>>> 2016-11-16 9:12 GMT+09:00 Arun Patel :
>>>
 I am trying to read an XML file which is 1GB is size.  I am getting an
 error 'java.lang.OutOfMemoryError: Requested array size exceeds VM
 limit' after reading 7 partitions in local mode.  In Yarn mode, it
 throws 'java.lang.OutOfMemoryError: Java heap space' error after
 reading 3 partitions.

 Any suggestion?

 PySpark Shell Command:pyspark --master local[4] --driver-memory 3G
 --jars / tmp/spark-xml_2.10-0.3.3.jar



 Dataframe Creation Command:   df = sqlContext.read.format('com.da
 tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')



 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0
 (TID 1) in 25978 ms on localhost (1/10)

 16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
 hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728

 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID
 2). 2309 bytes result sent to driver

 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0
 (TID 3, localhost, partition 3,ANY, 2266 bytes)

 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)

 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0
 (TID 2) in 51001 ms on localhost (2/10)

 16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
 hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728

 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID
 3). 2309 bytes result sent to driver

 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0
 (TID 4, localhost, partition 4,ANY, 2266 bytes)

 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)

 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0
 (TID 3) in 24336 ms on localhost (3/10)

 16/11/15 18:28:19 INFO NewHadoopRDD: Input split:
 hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728

 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID
 4). 2309 bytes result sent to driver

 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0
 (TID 5, localhost, partition 5,ANY, 2266 bytes)

 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)

 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0
 (TID 4) in 20895 ms on localhost (4/10)

 16/11/15 18:28:40 INFO NewHadoopRDD: Input split:
 hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728

 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID
 5). 2309 bytes result sent to driver

 16/11/15 

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Arun Patel
I tried below options.

1) Increase executor memory.  Increased up to maximum possibility 14GB.
Same error.
2) Tried new version - spark-xml_2.10:0.4.1.  Same error.
3) Tried with low level rowTags.  It worked for lower level rowTag and
returned 16000 rows.

Are there any workarounds for this issue?  I tried playing with
spark.memory.fraction
and spark.memory.storageFraction.  But, it did not help.  Appreciate your
help on this!!!



On Tue, Nov 15, 2016 at 8:44 PM, Arun Patel  wrote:

> Thanks for the quick response.
>
> Its a single XML file and I am using a top level rowTag.  So, it creates
> only one row in a Dataframe with 5 columns. One of these columns will
> contain most of the data as StructType.  Is there a limitation to store
> data in a cell of a Dataframe?
>
> I will check with new version and try to use different rowTags and
> increase executor-memory tomorrow. I will open a new issue as well.
>
>
>
> On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon  wrote:
>
>> Hi Arun,
>>
>>
>> I have few questions.
>>
>> Dose your XML file have like few huge documents? In this case of a row
>> having a huge size like (like 500MB), it would consume a lot of memory
>>
>> becuase at least it should hold a row to iterate if I remember correctly.
>> I remember this happened to me before while processing a huge record for
>> test purpose.
>>
>>
>> How about trying to increase --executor-memory?
>>
>>
>> Also, you could try to select only few fields to prune the data with the
>> latest version just to doubly sure if you don't mind?.
>>
>>
>> Lastly, do you mind if I ask to open an issue in
>> https://github.com/databricks/spark-xml/issues if you still face this
>> problem?
>>
>> I will try to take a look at my best.
>>
>>
>> Thank you.
>>
>>
>> 2016-11-16 9:12 GMT+09:00 Arun Patel :
>>
>>> I am trying to read an XML file which is 1GB is size.  I am getting an
>>> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM
>>> limit' after reading 7 partitions in local mode.  In Yarn mode, it
>>> throws 'java.lang.OutOfMemoryError: Java heap space' error after
>>> reading 3 partitions.
>>>
>>> Any suggestion?
>>>
>>> PySpark Shell Command:pyspark --master local[4] --driver-memory 3G
>>> --jars / tmp/spark-xml_2.10-0.3.3.jar
>>>
>>>
>>>
>>> Dataframe Creation Command:   df = sqlContext.read.format('com.da
>>> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')
>>>
>>>
>>>
>>> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0
>>> (TID 1) in 25978 ms on localhost (1/10)
>>>
>>> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728
>>>
>>> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2).
>>> 2309 bytes result sent to driver
>>>
>>> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0
>>> (TID 3, localhost, partition 3,ANY, 2266 bytes)
>>>
>>> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
>>>
>>> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0
>>> (TID 2) in 51001 ms on localhost (2/10)
>>>
>>> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728
>>>
>>> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3).
>>> 2309 bytes result sent to driver
>>>
>>> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0
>>> (TID 4, localhost, partition 4,ANY, 2266 bytes)
>>>
>>> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
>>>
>>> 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0
>>> (TID 3) in 24336 ms on localhost (3/10)
>>>
>>> 16/11/15 18:28:19 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728
>>>
>>> 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4).
>>> 2309 bytes result sent to driver
>>>
>>> 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0
>>> (TID 5, localhost, partition 5,ANY, 2266 bytes)
>>>
>>> 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
>>>
>>> 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0
>>> (TID 4) in 20895 ms on localhost (4/10)
>>>
>>> 16/11/15 18:28:40 INFO NewHadoopRDD: Input split:
>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728
>>>
>>> 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5).
>>> 2309 bytes result sent to driver
>>>
>>> 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0
>>> (TID 6, localhost, partition 6,ANY, 2266 bytes)
>>>
>>> 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
>>>
>>> 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0
>>> (TID 5) in 20793 ms on localhost (5/10)
>>>
>>> 16/11/15 18:29:01 INFO NewHadoopRDD: Input split:
>>> 

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Arun Patel
Thanks for the quick response.

Its a single XML file and I am using a top level rowTag.  So, it creates
only one row in a Dataframe with 5 columns. One of these columns will
contain most of the data as StructType.  Is there a limitation to store
data in a cell of a Dataframe?

I will check with new version and try to use different rowTags and increase
executor-memory tomorrow. I will open a new issue as well.



On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon  wrote:

> Hi Arun,
>
>
> I have few questions.
>
> Dose your XML file have like few huge documents? In this case of a row
> having a huge size like (like 500MB), it would consume a lot of memory
>
> becuase at least it should hold a row to iterate if I remember correctly.
> I remember this happened to me before while processing a huge record for
> test purpose.
>
>
> How about trying to increase --executor-memory?
>
>
> Also, you could try to select only few fields to prune the data with the
> latest version just to doubly sure if you don't mind?.
>
>
> Lastly, do you mind if I ask to open an issue in https://github.com/
> databricks/spark-xml/issues if you still face this problem?
>
> I will try to take a look at my best.
>
>
> Thank you.
>
>
> 2016-11-16 9:12 GMT+09:00 Arun Patel :
>
>> I am trying to read an XML file which is 1GB is size.  I am getting an
>> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM
>> limit' after reading 7 partitions in local mode.  In Yarn mode, it
>> throws 'java.lang.OutOfMemoryError: Java heap space' error after reading
>> 3 partitions.
>>
>> Any suggestion?
>>
>> PySpark Shell Command:pyspark --master local[4] --driver-memory 3G
>> --jars / tmp/spark-xml_2.10-0.3.3.jar
>>
>>
>>
>> Dataframe Creation Command:   df = sqlContext.read.format('com.da
>> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')
>>
>>
>>
>> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0
>> (TID 1) in 25978 ms on localhost (1/10)
>>
>> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728
>>
>> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0
>> (TID 3, localhost, partition 3,ANY, 2266 bytes)
>>
>> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
>>
>> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0
>> (TID 2) in 51001 ms on localhost (2/10)
>>
>> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728
>>
>> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0
>> (TID 4, localhost, partition 4,ANY, 2266 bytes)
>>
>> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
>>
>> 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0
>> (TID 3) in 24336 ms on localhost (3/10)
>>
>> 16/11/15 18:28:19 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728
>>
>> 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0
>> (TID 5, localhost, partition 5,ANY, 2266 bytes)
>>
>> 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
>>
>> 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0
>> (TID 4) in 20895 ms on localhost (4/10)
>>
>> 16/11/15 18:28:40 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728
>>
>> 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0
>> (TID 6, localhost, partition 6,ANY, 2266 bytes)
>>
>> 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
>>
>> 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0
>> (TID 5) in 20793 ms on localhost (5/10)
>>
>> 16/11/15 18:29:01 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728
>>
>> 16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6).
>> 2309 bytes result sent to driver
>>
>> 16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0
>> (TID 7, localhost, partition 7,ANY, 2266 bytes)
>>
>> 16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
>>
>> 16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0
>> (TID 6) in 21306 ms on localhost (6/10)
>>
>> 16/11/15 18:29:22 INFO NewHadoopRDD: Input split:
>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728
>>
>> 16/11/15 

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Hyukjin Kwon
Hi Arun,


I have few questions.

Dose your XML file have like few huge documents? In this case of a row
having a huge size like (like 500MB), it would consume a lot of memory

becuase at least it should hold a row to iterate if I remember correctly. I
remember this happened to me before while processing a huge record for test
purpose.


How about trying to increase --executor-memory?


Also, you could try to select only few fields to prune the data with the
latest version just to doubly sure if you don't mind?.


Lastly, do you mind if I ask to open an issue in
https://github.com/databricks/spark-xml/issues if you still face this
problem?

I will try to take a look at my best.


Thank you.


2016-11-16 9:12 GMT+09:00 Arun Patel :

> I am trying to read an XML file which is 1GB is size.  I am getting an
> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM limit'
> after reading 7 partitions in local mode.  In Yarn mode, it
> throws 'java.lang.OutOfMemoryError: Java heap space' error after reading
> 3 partitions.
>
> Any suggestion?
>
> PySpark Shell Command:pyspark --master local[4] --driver-memory 3G
> --jars / tmp/spark-xml_2.10-0.3.3.jar
>
>
>
> Dataframe Creation Command:   df = sqlContext.read.format('com.da
> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')
>
>
>
> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID
> 1) in 25978 ms on localhost (1/10)
>
> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728
>
> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2).
> 2309 bytes result sent to driver
>
> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID
> 3, localhost, partition 3,ANY, 2266 bytes)
>
> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
>
> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID
> 2) in 51001 ms on localhost (2/10)
>
> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728
>
> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3).
> 2309 bytes result sent to driver
>
> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID
> 4, localhost, partition 4,ANY, 2266 bytes)
>
> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
>
> 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID
> 3) in 24336 ms on localhost (3/10)
>
> 16/11/15 18:28:19 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728
>
> 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4).
> 2309 bytes result sent to driver
>
> 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID
> 5, localhost, partition 5,ANY, 2266 bytes)
>
> 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
>
> 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID
> 4) in 20895 ms on localhost (4/10)
>
> 16/11/15 18:28:40 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728
>
> 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5).
> 2309 bytes result sent to driver
>
> 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID
> 6, localhost, partition 6,ANY, 2266 bytes)
>
> 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
>
> 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID
> 5) in 20793 ms on localhost (5/10)
>
> 16/11/15 18:29:01 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728
>
> 16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6).
> 2309 bytes result sent to driver
>
> 16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID
> 7, localhost, partition 7,ANY, 2266 bytes)
>
> 16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
>
> 16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID
> 6) in 21306 ms on localhost (6/10)
>
> 16/11/15 18:29:22 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728
>
> 16/11/15 18:29:43 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
> 2309 bytes result sent to driver
>
> 16/11/15 18:29:43 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID
> 8, localhost, partition 8,ANY, 2266 bytes)
>
> 16/11/15 18:29:43 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
>
> 16/11/15 18:29:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID
> 7) in 21130 ms on localhost (7/10)
>
> 16/11/15 18:29:43 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:1073741824+134217728
>
> 16/11/15 18:29:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
> 0)
>
> 

Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Arun Patel
I am trying to read an XML file which is 1GB is size.  I am getting an
error 'java.lang.OutOfMemoryError: Requested array size exceeds VM limit'
after reading 7 partitions in local mode.  In Yarn mode, it
throws 'java.lang.OutOfMemoryError: Java heap space' error after reading 3
partitions.

Any suggestion?

PySpark Shell Command:pyspark --master local[4] --driver-memory 3G
--jars / tmp/spark-xml_2.10-0.3.3.jar



Dataframe Creation Command:   df = sqlContext.read.format('com.
databricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')



16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID
1) in 25978 ms on localhost (1/10)

16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728

16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2).
2309 bytes result sent to driver

16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID
3, localhost, partition 3,ANY, 2266 bytes)

16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)

16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID
2) in 51001 ms on localhost (2/10)

16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728

16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3).
2309 bytes result sent to driver

16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID
4, localhost, partition 4,ANY, 2266 bytes)

16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)

16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID
3) in 24336 ms on localhost (3/10)

16/11/15 18:28:19 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728

16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4).
2309 bytes result sent to driver

16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID
5, localhost, partition 5,ANY, 2266 bytes)

16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)

16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID
4) in 20895 ms on localhost (4/10)

16/11/15 18:28:40 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728

16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5).
2309 bytes result sent to driver

16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID
6, localhost, partition 6,ANY, 2266 bytes)

16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)

16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID
5) in 20793 ms on localhost (5/10)

16/11/15 18:29:01 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728

16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6).
2309 bytes result sent to driver

16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID
7, localhost, partition 7,ANY, 2266 bytes)

16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)

16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID
6) in 21306 ms on localhost (6/10)

16/11/15 18:29:22 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728

16/11/15 18:29:43 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
2309 bytes result sent to driver

16/11/15 18:29:43 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID
8, localhost, partition 8,ANY, 2266 bytes)

16/11/15 18:29:43 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)

16/11/15 18:29:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID
7) in 21130 ms on localhost (7/10)

16/11/15 18:29:43 INFO NewHadoopRDD: Input split:
hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:1073741824+134217728

16/11/15 18:29:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

at java.util.Arrays.copyOf(Arrays.java:2271)

at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.
java:113)

at java.io.ByteArrayOutputStream.ensureCapacity(
ByteArrayOutputStream.java:93)

at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.
java:122)

at java.io.DataOutputStream.write(DataOutputStream.java:88)

at com.databricks.spark.xml.XmlRecordReader.
readUntilMatch(XmlInputFormat.scala:188)

at com.databricks.spark.xml.XmlRecordReader.next(
XmlInputFormat.scala:156)

at com.databricks.spark.xml.XmlRecordReader.nextKeyValue(
XmlInputFormat.scala:141)

at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(
NewHadoopRDD.scala:168)

at org.apache.spark.InterruptibleIterator.hasNext(
InterruptibleIterator.scala:39)

at 

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu
Spark 2.0 has been released.

Mind giving it a try :-) ?

On Wed, Aug 3, 2016 at 9:11 AM, Rychnovsky, Dusan <
dusan.rychnov...@firma.seznam.cz> wrote:

> OK, thank you. What do you suggest I do to get rid of the error?
>
>
> --
> *From:* Ted Yu <yuzhih...@gmail.com>
> *Sent:* Wednesday, August 3, 2016 6:10 PM
> *To:* Rychnovsky, Dusan
> *Cc:* user@spark.apache.org
> *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable to
> acquire X bytes of memory, got 0
>
> The latest QA run was no longer accessible (error 404):
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59141/consoleFull
>
> Looking at the comments on the PR, there is not enough confidence in
> pulling in the fix into 1.6
>
> On Wed, Aug 3, 2016 at 9:05 AM, Rychnovsky, Dusan <
> dusan.rychnov...@firma.seznam.cz> wrote:
>
>> I am confused.
>>
>>
>> I tried to look for Spark that would have this issue fixed, i.e.
>> https://github.com/apache/spark/pull/13027/ merged in, but it looks like
>> the patch has not been merged for 1.6.
>>
>>
>> How do I get a fixed 1.6 version?
>>
>>
>> Thanks,
>>
>> Dusan
>>
>>
>> <https://github.com/apache/spark/pull/13027/>
>> [SPARK-4452][SPARK-11293][Core][BRANCH-1.6] Shuffle data structures can
>> starve others on the same thread for memory by lianhuiwang · Pull Request
>> #13027 · apache/spark · GitHub
>> What changes were proposed in this pull request? This PR is for the
>> branch-1.6 version of the commits PR #10024. In #9241 It implemented a
>> mechanism to call spill() on those SQL operators that sup...
>> Read more... <https://github.com/apache/spark/pull/13027/>
>>
>>
>>
>> --
>> *From:* Rychnovsky, Dusan
>> *Sent:* Wednesday, August 3, 2016 3:58 PM
>> *To:* Ted Yu
>>
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable
>> to acquire X bytes of memory, got 0
>>
>>
>> Yes, I believe I'm using Spark 1.6.0.
>>
>>
>> > spark-submit --version
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>>   /_/
>>
>> I don't understand the ticket. It says "Fixed in 1.6.0". I have 1.6.0 and
>> therefore should have it fixed, right? Or what do I do to fix it?
>>
>>
>> Thanks,
>>
>> Dusan
>>
>>
>> --
>> *From:* Ted Yu <yuzhih...@gmail.com>
>> *Sent:* Wednesday, August 3, 2016 3:52 PM
>> *To:* Rychnovsky, Dusan
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable
>> to acquire X bytes of memory, got 0
>>
>> Are you using Spark 1.6+ ?
>>
>> See SPARK-11293
>>
>> On Wed, Aug 3, 2016 at 5:03 AM, Rychnovsky, Dusan <
>> dusan.rychnov...@firma.seznam.cz> wrote:
>>
>>> Hi,
>>>
>>>
>>> I have a Spark workflow that when run on a relatively small portion of
>>> data works fine, but when run on big data fails with strange errors. In the
>>> log files of failed executors I found the following errors:
>>>
>>>
>>> Firstly
>>>
>>>
>>> > Managed memory leak detected; size = 263403077 bytes, TID = 6524
>>>
>>> And then a series of
>>>
>>> > java.lang.OutOfMemoryError: Unable to acquire 241 bytes of memory, got
>>> 0
>>>
>>> > at
>>> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>>>
>>>
>>> > at
>>> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
>>>
>>>
>>> > at
>>> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
>>>
>>>
>>> > at
>>> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
>>>
>>>
>>> > at
>>> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
>>>
>>>
>>> > at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>>
>>> > at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>
>>> > at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>
>>> > at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>>
>>> > at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>
>>>
>>> > at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>
>>>
>>> > at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> The job keeps failing in the same way (I tried a few times).
>>>
>>>
>>> What could be causing such error?
>>>
>>> I have a feeling that I'm not providing enough context necessary to
>>> understand the issue. Please ask for any other information needed.
>>>
>>>
>>> Thank you,
>>>
>>> Dusan
>>>
>>>
>>>
>>
>


Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
OK, thank you. What do you suggest I do to get rid of the error?



From: Ted Yu <yuzhih...@gmail.com>
Sent: Wednesday, August 3, 2016 6:10 PM
To: Rychnovsky, Dusan
Cc: user@spark.apache.org
Subject: Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire 
X bytes of memory, got 0

The latest QA run was no longer accessible (error 404):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59141/consoleFull

Looking at the comments on the PR, there is not enough confidence in pulling in 
the fix into 1.6

On Wed, Aug 3, 2016 at 9:05 AM, Rychnovsky, Dusan 
<dusan.rychnov...@firma.seznam.cz<mailto:dusan.rychnov...@firma.seznam.cz>> 
wrote:

I am confused.


I tried to look for Spark that would have this issue fixed, i.e. 
https://github.com/apache/spark/pull/13027/ merged in, but it looks like the 
patch has not been merged for 1.6.


How do I get a fixed 1.6 version?


Thanks,

Dusan


[https://avatars2.githubusercontent.com/u/545478?v=3=400]<https://github.com/apache/spark/pull/13027/>

[SPARK-4452][SPARK-11293][Core][BRANCH-1.6] Shuffle data structures can starve 
others on the same thread for memory by lianhuiwang · Pull Request #13027 · 
apache/spark · GitHub
What changes were proposed in this pull request? This PR is for the branch-1.6 
version of the commits PR #10024. In #9241 It implemented a mechanism to call 
spill() on those SQL operators that sup...
Read more...<https://github.com/apache/spark/pull/13027/>





From: Rychnovsky, Dusan
Sent: Wednesday, August 3, 2016 3:58 PM
To: Ted Yu

Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire 
X bytes of memory, got 0


Yes, I believe I'm using Spark 1.6.0.


> spark-submit --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
  /_/


I don't understand the ticket. It says "Fixed in 1.6.0". I have 1.6.0 and 
therefore should have it fixed, right? Or what do I do to fix it?


Thanks,

Dusan



From: Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>>
Sent: Wednesday, August 3, 2016 3:52 PM
To: Rychnovsky, Dusan
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire 
X bytes of memory, got 0

Are you using Spark 1.6+ ?

See SPARK-11293

On Wed, Aug 3, 2016 at 5:03 AM, Rychnovsky, Dusan 
<dusan.rychnov...@firma.seznam.cz<mailto:dusan.rychnov...@firma.seznam.cz>> 
wrote:

Hi,


I have a Spark workflow that when run on a relatively small portion of data 
works fine, but when run on big data fails with strange errors. In the log 
files of failed executors I found the following errors:


Firstly


> Managed memory leak detected; size = 263403077 bytes, TID = 6524

And then a series of

> java.lang.OutOfMemoryError: Unable to acquire 241 bytes of memory, got 0

> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)

> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)

> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)

> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)

> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)

> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)

> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

> at org.apache.spark.scheduler.Task.run(Task.scala:89)

> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

> at java.lang.Thread.run(Thread.java:745)


The job keeps failing in the same way (I tried a few times).


What could be causing such error?

I have a feeling that I'm not providing enough context necessary to understand 
the issue. Please ask for any other information needed.


Thank you,

Dusan





Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu
The latest QA run was no longer accessible (error 404):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59141/consoleFull

Looking at the comments on the PR, there is not enough confidence in
pulling in the fix into 1.6

On Wed, Aug 3, 2016 at 9:05 AM, Rychnovsky, Dusan <
dusan.rychnov...@firma.seznam.cz> wrote:

> I am confused.
>
>
> I tried to look for Spark that would have this issue fixed, i.e.
> https://github.com/apache/spark/pull/13027/ merged in, but it looks like
> the patch has not been merged for 1.6.
>
>
> How do I get a fixed 1.6 version?
>
>
> Thanks,
>
> Dusan
>
>
> <https://github.com/apache/spark/pull/13027/>
> [SPARK-4452][SPARK-11293][Core][BRANCH-1.6] Shuffle data structures can
> starve others on the same thread for memory by lianhuiwang · Pull Request
> #13027 · apache/spark · GitHub
> What changes were proposed in this pull request? This PR is for the
> branch-1.6 version of the commits PR #10024. In #9241 It implemented a
> mechanism to call spill() on those SQL operators that sup...
> Read more... <https://github.com/apache/spark/pull/13027/>
>
>
>
> --
> *From:* Rychnovsky, Dusan
> *Sent:* Wednesday, August 3, 2016 3:58 PM
> *To:* Ted Yu
>
> *Cc:* user@spark.apache.org
> *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable to
> acquire X bytes of memory, got 0
>
>
> Yes, I believe I'm using Spark 1.6.0.
>
>
> > spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>   /_/
>
> I don't understand the ticket. It says "Fixed in 1.6.0". I have 1.6.0 and
> therefore should have it fixed, right? Or what do I do to fix it?
>
>
> Thanks,
>
> Dusan
>
>
> --
> *From:* Ted Yu <yuzhih...@gmail.com>
> *Sent:* Wednesday, August 3, 2016 3:52 PM
> *To:* Rychnovsky, Dusan
> *Cc:* user@spark.apache.org
> *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable to
> acquire X bytes of memory, got 0
>
> Are you using Spark 1.6+ ?
>
> See SPARK-11293
>
> On Wed, Aug 3, 2016 at 5:03 AM, Rychnovsky, Dusan <
> dusan.rychnov...@firma.seznam.cz> wrote:
>
>> Hi,
>>
>>
>> I have a Spark workflow that when run on a relatively small portion of
>> data works fine, but when run on big data fails with strange errors. In the
>> log files of failed executors I found the following errors:
>>
>>
>> Firstly
>>
>>
>> > Managed memory leak detected; size = 263403077 bytes, TID = 6524
>>
>> And then a series of
>>
>> > java.lang.OutOfMemoryError: Unable to acquire 241 bytes of memory, got
>> 0
>>
>> > at
>> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>>
>>
>> > at
>> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
>>
>>
>> > at
>> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
>>
>>
>> > at
>> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
>>
>>
>> > at
>> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
>>
>>
>> > at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>
>> > at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>
>> > at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>
>> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>
>> > at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>
>>
>> > at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>
>>
>> > at java.lang.Thread.run(Thread.java:745)
>>
>>
>> The job keeps failing in the same way (I tried a few times).
>>
>>
>> What could be causing such error?
>>
>> I have a feeling that I'm not providing enough context necessary to
>> understand the issue. Please ask for any other information needed.
>>
>>
>> Thank you,
>>
>> Dusan
>>
>>
>>
>


Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
I am confused.


I tried to look for Spark that would have this issue fixed, i.e. 
https://github.com/apache/spark/pull/13027/ merged in, but it looks like the 
patch has not been merged for 1.6.


How do I get a fixed 1.6 version?


Thanks,

Dusan


[https://avatars2.githubusercontent.com/u/545478?v=3=400]<https://github.com/apache/spark/pull/13027/>

[SPARK-4452][SPARK-11293][Core][BRANCH-1.6] Shuffle data structures can starve 
others on the same thread for memory by lianhuiwang · Pull Request #13027 · 
apache/spark · GitHub
What changes were proposed in this pull request? This PR is for the branch-1.6 
version of the commits PR #10024. In #9241 It implemented a mechanism to call 
spill() on those SQL operators that sup...
Read more...<https://github.com/apache/spark/pull/13027/>





From: Rychnovsky, Dusan
Sent: Wednesday, August 3, 2016 3:58 PM
To: Ted Yu
Cc: user@spark.apache.org
Subject: Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire 
X bytes of memory, got 0


Yes, I believe I'm using Spark 1.6.0.


> spark-submit --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
  /_/


I don't understand the ticket. It says "Fixed in 1.6.0". I have 1.6.0 and 
therefore should have it fixed, right? Or what do I do to fix it?


Thanks,

Dusan



From: Ted Yu <yuzhih...@gmail.com>
Sent: Wednesday, August 3, 2016 3:52 PM
To: Rychnovsky, Dusan
Cc: user@spark.apache.org
Subject: Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire 
X bytes of memory, got 0

Are you using Spark 1.6+ ?

See SPARK-11293

On Wed, Aug 3, 2016 at 5:03 AM, Rychnovsky, Dusan 
<dusan.rychnov...@firma.seznam.cz<mailto:dusan.rychnov...@firma.seznam.cz>> 
wrote:

Hi,


I have a Spark workflow that when run on a relatively small portion of data 
works fine, but when run on big data fails with strange errors. In the log 
files of failed executors I found the following errors:


Firstly


> Managed memory leak detected; size = 263403077 bytes, TID = 6524

And then a series of

> java.lang.OutOfMemoryError: Unable to acquire 241 bytes of memory, got 0

> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)

> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)

> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)

> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)

> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)

> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)

> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

> at org.apache.spark.scheduler.Task.run(Task.scala:89)

> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

> at java.lang.Thread.run(Thread.java:745)


The job keeps failing in the same way (I tried a few times).


What could be causing such error?

I have a feeling that I'm not providing enough context necessary to understand 
the issue. Please ask for any other information needed.


Thank you,

Dusan




Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
Yes, I believe I'm using Spark 1.6.0.


> spark-submit --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
  /_/


I don't understand the ticket. It says "Fixed in 1.6.0". I have 1.6.0 and 
therefore should have it fixed, right? Or what do I do to fix it?


Thanks,

Dusan



From: Ted Yu <yuzhih...@gmail.com>
Sent: Wednesday, August 3, 2016 3:52 PM
To: Rychnovsky, Dusan
Cc: user@spark.apache.org
Subject: Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire 
X bytes of memory, got 0

Are you using Spark 1.6+ ?

See SPARK-11293

On Wed, Aug 3, 2016 at 5:03 AM, Rychnovsky, Dusan 
<dusan.rychnov...@firma.seznam.cz<mailto:dusan.rychnov...@firma.seznam.cz>> 
wrote:

Hi,


I have a Spark workflow that when run on a relatively small portion of data 
works fine, but when run on big data fails with strange errors. In the log 
files of failed executors I found the following errors:


Firstly


> Managed memory leak detected; size = 263403077 bytes, TID = 6524

And then a series of

> java.lang.OutOfMemoryError: Unable to acquire 241 bytes of memory, got 0

> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)

> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)

> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)

> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)

> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)

> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)

> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

> at org.apache.spark.scheduler.Task.run(Task.scala:89)

> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

> at java.lang.Thread.run(Thread.java:745)


The job keeps failing in the same way (I tried a few times).


What could be causing such error?

I have a feeling that I'm not providing enough context necessary to understand 
the issue. Please ask for any other information needed.


Thank you,

Dusan




Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu
Are you using Spark 1.6+ ?

See SPARK-11293

On Wed, Aug 3, 2016 at 5:03 AM, Rychnovsky, Dusan <
dusan.rychnov...@firma.seznam.cz> wrote:

> Hi,
>
>
> I have a Spark workflow that when run on a relatively small portion of
> data works fine, but when run on big data fails with strange errors. In the
> log files of failed executors I found the following errors:
>
>
> Firstly
>
>
> > Managed memory leak detected; size = 263403077 bytes, TID = 6524
>
> And then a series of
>
> > java.lang.OutOfMemoryError: Unable to acquire 241 bytes of memory, got 0
>
> > at
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>
>
> > at
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
>
>
> > at
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
>
>
> > at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
>
>
> > at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
>
>
> > at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>
> > at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> > at org.apache.spark.scheduler.Task.run(Task.scala:89)
>
> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> > at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
>
> > at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
>
> > at java.lang.Thread.run(Thread.java:745)
>
>
> The job keeps failing in the same way (I tried a few times).
>
>
> What could be causing such error?
>
> I have a feeling that I'm not providing enough context necessary to
> understand the issue. Please ask for any other information needed.
>
>
> Thank you,
>
> Dusan
>
>
>


Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Rychnovsky, Dusan
Hi,


I have a Spark workflow that when run on a relatively small portion of data 
works fine, but when run on big data fails with strange errors. In the log 
files of failed executors I found the following errors:


Firstly


> Managed memory leak detected; size = 263403077 bytes, TID = 6524

And then a series of

> java.lang.OutOfMemoryError: Unable to acquire 241 bytes of memory, got 0

> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)

> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)

> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)

> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)

> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)

> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)

> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

> at org.apache.spark.scheduler.Task.run(Task.scala:89)

> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

> at java.lang.Thread.run(Thread.java:745)


The job keeps failing in the same way (I tried a few times).


What could be causing such error?

I have a feeling that I'm not providing enough context necessary to understand 
the issue. Please ask for any other information needed.


Thank you,

Dusan



Re: OutOfMemoryError - When saving Word2Vec

2016-06-13 Thread Yuhao Yang
Hi Sharad,

what's your vocabulary size and vector length for Word2Vec?

Regards,
Yuhao

2016-06-13 20:04 GMT+08:00 sharad82 <khandelwal.gem...@gmail.com>:

> Is this the right forum to post Spark related issues ? I have tried this
> forum along with StackOverflow but not seeing any response.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-When-saving-Word2Vec-tp27142p27151.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: OutOfMemoryError - When saving Word2Vec

2016-06-13 Thread sharad82
Is this the right forum to post Spark related issues ? I have tried this
forum along with StackOverflow but not seeing any response.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-When-saving-Word2Vec-tp27142p27151.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: OutOfMemoryError - When saving Word2Vec

2016-06-12 Thread vaquar khan
Hi Sharad.

The array size you (or the serializer) tries to allocate is just too big
for the JVM.

You can also split your input further by increasing parallelism.

Following is good explanintion

https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit

regards,
Vaquar khan

On Sun, Jun 12, 2016 at 5:08 AM, sharad82 <khandelwal.gem...@gmail.com>
wrote:

> When trying to save the word2vec model trained over 10G of data leads to
> below OOM error.
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> Spark Version: 1.6
> spark.dynamicAllocation.enable  false
> spark.executor.memory   75g
> spark.driver.memory 150g
> spark.driver.cores  10
>
> Full Stack Trace:
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3332)
> at
>
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> at
>
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
> at
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
> at java.lang.StringBuilder.append(StringBuilder.java:136)
> at java.lang.StringBuilder.append(StringBuilder.java:131)
> at
> scala.StringContext.standardInterpolator(StringContext.scala:122)
> at scala.StringContext.s(StringContext.scala:90)
> at
>
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70)
> at
>
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> at
>
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
> at
>
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at
>
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at
>
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at
>
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at
>
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
> at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
> at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
> at
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)
> at
>
> org.apache.spark.ml.feature.Word2VecModel$Word2VecModelWriter.saveImpl(Word2Vec.scala:271)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:91)
> at
> org.apache.spark.ml.util.MLWritable$class.save(ReadWrite.scala:131)
> at
> org.apache.spark.ml.feature.Word2VecModel.save(Word2Vec.scala:172)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-When-saving-Word2Vec-tp27142.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Regards,
Vaquar Khan
+91 830-851-1500


OutOfMemoryError - When saving Word2Vec

2016-06-12 Thread sharad82
When trying to save the word2vec model trained over 10G of data leads to
below OOM error.

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

Spark Version: 1.6
spark.dynamicAllocation.enable  false
spark.executor.memory   75g
spark.driver.memory 150g
spark.driver.cores  10

Full Stack Trace:

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3332)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at scala.StringContext.standardInterpolator(StringContext.scala:122)
at scala.StringContext.s(StringContext.scala:90)
at
org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)
at
org.apache.spark.ml.feature.Word2VecModel$Word2VecModelWriter.saveImpl(Word2Vec.scala:271)
at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:91)
at org.apache.spark.ml.util.MLWritable$class.save(ReadWrite.scala:131)
at org.apache.spark.ml.feature.Word2VecModel.save(Word2Vec.scala:172)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-When-saving-Word2Vec-tp27142.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
Thanks for the suggestions and links. The problem arises when I used
DataFrame api to write but it works fine when doing insert overwrite in
hive table.

# Works good
hive_context.sql("insert overwrite table {0} partiton (e_dt, c_dt) select *
from temp_table".format(table_name))
# Doesn't work, throws java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
df.write.mode('overwrite').partitionBy('e_dt','c_dt').parquet("/path/to/file/")

Thanks,
Bijay

On Wed, May 4, 2016 at 3:02 PM, Prajwal Tuladhar  wrote:

> If you are running on 64-bit JVM with less than 32G heap, you might want
> to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow
> generating more than 2^31-1 number of arrays, you might have to rethink
> your options.
>
> [1] https://spark.apache.org/docs/latest/tuning.html
>
> On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak 
> wrote:
>
>> Hi,
>>
>> I am reading the parquet file around 50+ G which has 4013 partitions with
>> 240 columns. Below is my configuration
>>
>> driver : 20G memory with 4 cores
>> executors: 45 executors with 15G memory and 4 cores.
>>
>> I tried to read the data using both Dataframe read and using hive context
>> to read the data using hive SQL but for the both cases, it throws me below
>> error with no  further description on error.
>>
>> hive_context.sql("select * from test.base_table where
>> date='{0}'".format(part_dt))
>> sqlcontext.read.parquet("/path/to/partion/")
>>
>> #
>> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>> # -XX:OnOutOfMemoryError="kill -9 %p"
>> #   Executing /bin/sh -c "kill -9 16953"...
>>
>>
>> What could be wrong over here since I think increasing memory only will
>> not help in this case since it reached the array size limit.
>>
>> Thanks,
>> Bijay
>>
>
>
>
> --
> --
> Cheers,
> Praj
>


Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Prajwal Tuladhar
If you are running on 64-bit JVM with less than 32G heap, you might want to
enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow
generating more than 2^31-1 number of arrays, you might have to rethink
your options.

[1] https://spark.apache.org/docs/latest/tuning.html

On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak  wrote:

> Hi,
>
> I am reading the parquet file around 50+ G which has 4013 partitions with
> 240 columns. Below is my configuration
>
> driver : 20G memory with 4 cores
> executors: 45 executors with 15G memory and 4 cores.
>
> I tried to read the data using both Dataframe read and using hive context
> to read the data using hive SQL but for the both cases, it throws me below
> error with no  further description on error.
>
> hive_context.sql("select * from test.base_table where
> date='{0}'".format(part_dt))
> sqlcontext.read.parquet("/path/to/partion/")
>
> #
> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 16953"...
>
>
> What could be wrong over here since I think increasing memory only will
> not help in this case since it reached the array size limit.
>
> Thanks,
> Bijay
>



-- 
--
Cheers,
Praj


Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Ted Yu
Have you seen this thread ?

http://search-hadoop.com/m/q3RTtyXr2N13hf9O=java+lang+OutOfMemoryError+Requested+array+size+exceeds+VM+limit

On Wed, May 4, 2016 at 2:44 PM, Bijay Kumar Pathak <bkpat...@mtu.edu> wrote:

> Hi,
>
> I am reading the parquet file around 50+ G which has 4013 partitions with
> 240 columns. Below is my configuration
>
> driver : 20G memory with 4 cores
> executors: 45 executors with 15G memory and 4 cores.
>
> I tried to read the data using both Dataframe read and using hive context
> to read the data using hive SQL but for the both cases, it throws me below
> error with no  further description on error.
>
> hive_context.sql("select * from test.base_table where
> date='{0}'".format(part_dt))
> sqlcontext.read.parquet("/path/to/partion/")
>
> #
> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 16953"...
>
>
> What could be wrong over here since I think increasing memory only will
> not help in this case since it reached the array size limit.
>
> Thanks,
> Bijay
>


SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
Hi,

I am reading the parquet file around 50+ G which has 4013 partitions with
240 columns. Below is my configuration

driver : 20G memory with 4 cores
executors: 45 executors with 15G memory and 4 cores.

I tried to read the data using both Dataframe read and using hive context
to read the data using hive SQL but for the both cases, it throws me below
error with no  further description on error.

hive_context.sql("select * from test.base_table where
date='{0}'".format(part_dt))
sqlcontext.read.parquet("/path/to/partion/")

#
# java.lang.OutOfMemoryError: Requested array size exceeds VM limit
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 16953"...


What could be wrong over here since I think increasing memory only will not
help in this case since it reached the array size limit.

Thanks,
Bijay


Re: PCA OutOfMemoryError

2016-01-17 Thread Bharath Ravi Kumar
Hello Alex,

Thanks for the response. There isn't much other data on the driver, so the
issue is probably inherent to this particular PCA implementation.  I'll try
the alternative approach that you suggested instead. Thanks again.

-Bharath

On Wed, Jan 13, 2016 at 11:24 PM, Alex Gittens  wrote:

> The PCA.fit function calls the RowMatrix PCA routine, which attempts to
> construct the covariance matrix locally on the driver, and then computes
> the SVD of that to get the PCs. I'm not sure what's causing the memory
> error: RowMatrix.scala:124 is only using 3.5 GB of memory (n*(n+1)/2 with
> n=29604 and double precision), so unless you're filling up the memory with
> other RDDs, you should have plenty of space on the driver.
>
> One alternative is to manually center your RDD (so make one pass over it
> to compute the mean, then another to subtract it out and form a new RDD),
> then directly call the computeSVD routine in RowMatrix to compute the SVD
> of the gramian matrix of this RDD (e.g., the covariance matrix of the
> original RDD) in a distributed manner, so the covariance matrix doesn't
> need to be formed explicitly. You can look at the getLowRankFactorization
> and convertLowRankFactorizationToEOFs routines at
>
> https://github.com/rustandruin/large-scale-climate/blob/master/src/main/scala/eofs.scala
> for example of this approach (call the second on the results of the first
> to get the SVD of the input matrix to the first; EOF is another name for
> PCA).
>
> This takes about 30 minutes to compute the top 20 PCs of a 46.7K-by-6.3M
> dense matrix of doubles (~2 Tb), with most of the time spent on the
> distributed matrix-vector multiplies.
>
> Best,
> Alex
>
>
> On Tue, Jan 12, 2016 at 6:39 PM, Bharath Ravi Kumar 
> wrote:
>
>> Any suggestion/opinion?
>> On 12-Jan-2016 2:06 pm, "Bharath Ravi Kumar"  wrote:
>>
>>> We're running PCA (selecting 100 principal components) on a dataset that
>>> has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The
>>> matrix in question is mostly sparse with tens of columns populate in most
>>> rows, but a few rows with thousands of columns populated. We're running
>>> spark on mesos with driver memory set to 40G and executor memory set to
>>> 80G. We're however encountering an out of memory error (included at the end
>>> of the message) regardless of the number of rdd partitions or the degree of
>>> task parallelism being set. I noticed a warning at the beginning of the PCA
>>> computation stage: " WARN
>>> org.apache.spark.mllib.linalg.distributed.RowMatrix: 29604 columns will
>>> require at least 7011 megabyte  of memory!"
>>> I don't understand which memory this refers to. Is this the executor
>>> memory?  The driver memory? Any other?
>>> The stacktrace appears to indicate that a large array is probably being
>>> passed along with the task. Could this array have been passed as a
>>> broadcast variable instead ? Any suggestions / workarounds other than
>>> re-implementing the algorithm?
>>>
>>> Thanks,
>>> Bharath
>>>
>>> 
>>>
>>> Exception in thread "main" java.lang.OutOfMemoryError: Requested array
>>> size exceeds VM limit
>>> at java.util.Arrays.copyOf(Arrays.java:2271)
>>> at
>>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>> at
>>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>>> at
>>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>>> at
>>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>>> at
>>> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
>>> at
>>> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>>> at
>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>>> at
>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
>>> at
>>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
>>> at
>>> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>>> at
>>> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>>> at org.apache.spark.SparkContext.clean(SparkContext.scala:2030)
>>> at
>>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
>>> at
>>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>>> at 

Re: PCA OutOfMemoryError

2016-01-13 Thread Alex Gittens
The PCA.fit function calls the RowMatrix PCA routine, which attempts to
construct the covariance matrix locally on the driver, and then computes
the SVD of that to get the PCs. I'm not sure what's causing the memory
error: RowMatrix.scala:124 is only using 3.5 GB of memory (n*(n+1)/2 with
n=29604 and double precision), so unless you're filling up the memory with
other RDDs, you should have plenty of space on the driver.

One alternative is to manually center your RDD (so make one pass over it to
compute the mean, then another to subtract it out and form a new RDD), then
directly call the computeSVD routine in RowMatrix to compute the SVD of the
gramian matrix of this RDD (e.g., the covariance matrix of the original
RDD) in a distributed manner, so the covariance matrix doesn't need to be
formed explicitly. You can look at the getLowRankFactorization and
convertLowRankFactorizationToEOFs routines at
https://github.com/rustandruin/large-scale-climate/blob/master/src/main/scala/eofs.scala
for example of this approach (call the second on the results of the first
to get the SVD of the input matrix to the first; EOF is another name for
PCA).

This takes about 30 minutes to compute the top 20 PCs of a 46.7K-by-6.3M
dense matrix of doubles (~2 Tb), with most of the time spent on the
distributed matrix-vector multiplies.

Best,
Alex


On Tue, Jan 12, 2016 at 6:39 PM, Bharath Ravi Kumar 
wrote:

> Any suggestion/opinion?
> On 12-Jan-2016 2:06 pm, "Bharath Ravi Kumar"  wrote:
>
>> We're running PCA (selecting 100 principal components) on a dataset that
>> has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The
>> matrix in question is mostly sparse with tens of columns populate in most
>> rows, but a few rows with thousands of columns populated. We're running
>> spark on mesos with driver memory set to 40G and executor memory set to
>> 80G. We're however encountering an out of memory error (included at the end
>> of the message) regardless of the number of rdd partitions or the degree of
>> task parallelism being set. I noticed a warning at the beginning of the PCA
>> computation stage: " WARN
>> org.apache.spark.mllib.linalg.distributed.RowMatrix: 29604 columns will
>> require at least 7011 megabyte  of memory!"
>> I don't understand which memory this refers to. Is this the executor
>> memory?  The driver memory? Any other?
>> The stacktrace appears to indicate that a large array is probably being
>> passed along with the task. Could this array have been passed as a
>> broadcast variable instead ? Any suggestions / workarounds other than
>> re-implementing the algorithm?
>>
>> Thanks,
>> Bharath
>>
>> 
>>
>> Exception in thread "main" java.lang.OutOfMemoryError: Requested array
>> size exceeds VM limit
>> at java.util.Arrays.copyOf(Arrays.java:2271)
>> at
>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>> at
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>> at
>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>> at
>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>> at
>> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
>> at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
>> at
>> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>> at
>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
>> at
>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
>> at
>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
>> at
>> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>> at
>> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>> at org.apache.spark.SparkContext.clean(SparkContext.scala:2030)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1100)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>> at 

Re: PCA OutOfMemoryError

2016-01-12 Thread Bharath Ravi Kumar
Any suggestion/opinion?
On 12-Jan-2016 2:06 pm, "Bharath Ravi Kumar"  wrote:

> We're running PCA (selecting 100 principal components) on a dataset that
> has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The
> matrix in question is mostly sparse with tens of columns populate in most
> rows, but a few rows with thousands of columns populated. We're running
> spark on mesos with driver memory set to 40G and executor memory set to
> 80G. We're however encountering an out of memory error (included at the end
> of the message) regardless of the number of rdd partitions or the degree of
> task parallelism being set. I noticed a warning at the beginning of the PCA
> computation stage: " WARN
> org.apache.spark.mllib.linalg.distributed.RowMatrix: 29604 columns will
> require at least 7011 megabyte  of memory!"
> I don't understand which memory this refers to. Is this the executor
> memory?  The driver memory? Any other?
> The stacktrace appears to indicate that a large array is probably being
> passed along with the task. Could this array have been passed as a
> broadcast variable instead ? Any suggestions / workarounds other than
> re-implementing the algorithm?
>
> Thanks,
> Bharath
>
> 
>
> Exception in thread "main" java.lang.OutOfMemoryError: Requested array
> size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
> at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
> at
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
> at
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2030)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
> at
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1100)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1091)
> at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:124)
> at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:350)
> at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:386)
> at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:46)
>
>


PCA OutOfMemoryError

2016-01-12 Thread Bharath Ravi Kumar
We're running PCA (selecting 100 principal components) on a dataset that
has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The
matrix in question is mostly sparse with tens of columns populate in most
rows, but a few rows with thousands of columns populated. We're running
spark on mesos with driver memory set to 40G and executor memory set to
80G. We're however encountering an out of memory error (included at the end
of the message) regardless of the number of rdd partitions or the degree of
task parallelism being set. I noticed a warning at the beginning of the PCA
computation stage: " WARN
org.apache.spark.mllib.linalg.distributed.RowMatrix: 29604 columns will
require at least 7011 megabyte  of memory!"
I don't understand which memory this refers to. Is this the executor
memory?  The driver memory? Any other?
The stacktrace appears to indicate that a large array is probably being
passed along with the task. Could this array have been passed as a
broadcast variable instead ? Any suggestions / workarounds other than
re-implementing the algorithm?

Thanks,
Bharath



Exception in thread "main" java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271)
at
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2030)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
at
org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1100)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1091)
at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:124)
at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:350)
at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:386)
at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:46)


Lost tasks due to OutOfMemoryError (GC overhead limit exceeded)

2016-01-12 Thread Barak Yaish
Hello,

I've a 5 nodes cluster which hosts both hdfs datanodes and spark workers.
Each node has 8 cpu and 16G memory. Spark version is 1.5.2, spark-env.sh is
as follow:

export SPARK_MASTER_IP=10.52.39.92

export SPARK_WORKER_INSTANCES=4

export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=4g

And more settings done in the application code:

sparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.kryo.registrator",InternalKryoRegistrator.class.getName());
sparkConf.set("spark.kryo.registrationRequired","true");
sparkConf.set("spark.kryoserializer.buffer.max.mb","512");
sparkConf.set("spark.default.parallelism","300");
sparkConf.set("spark.rpc.askTimeout","500");

I'm trying to load data from hdfs and running some sqls on it (mostly
groupby) using DataFrames. The logs keep saying that tasks are lost due to
OutOfMemoryError (GC overhead limit exceeded).

Can you advice what is the recommended settings (memory, cores, partitions,
etc.) for the given hardware?

Thanks!


Re: Lost tasks due to OutOfMemoryError (GC overhead limit exceeded)

2016-01-12 Thread Muthu Jayakumar
>export SPARK_WORKER_MEMORY=4g
May be you could increase the max heapsize on the worker? In case if the
OutOfMemory is for the driver, then you may want to set it up explicitly
for the driver.

Thanks,



On Tue, Jan 12, 2016 at 2:04 AM, Barak Yaish <barak.ya...@gmail.com> wrote:

> Hello,
>
> I've a 5 nodes cluster which hosts both hdfs datanodes and spark workers.
> Each node has 8 cpu and 16G memory. Spark version is 1.5.2, spark-env.sh is
> as follow:
>
> export SPARK_MASTER_IP=10.52.39.92
>
> export SPARK_WORKER_INSTANCES=4
>
> export SPARK_WORKER_CORES=8
> export SPARK_WORKER_MEMORY=4g
>
> And more settings done in the application code:
>
>
> sparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
>
> sparkConf.set("spark.kryo.registrator",InternalKryoRegistrator.class.getName());
> sparkConf.set("spark.kryo.registrationRequired","true");
> sparkConf.set("spark.kryoserializer.buffer.max.mb","512");
> sparkConf.set("spark.default.parallelism","300");
> sparkConf.set("spark.rpc.askTimeout","500");
>
> I'm trying to load data from hdfs and running some sqls on it (mostly
> groupby) using DataFrames. The logs keep saying that tasks are lost due to
> OutOfMemoryError (GC overhead limit exceeded).
>
> Can you advice what is the recommended settings (memory, cores,
> partitions, etc.) for the given hardware?
>
> Thanks!
>


Re: OutOfMemoryError When Reading Many json Files

2015-10-14 Thread Deenar Toraskar
Hi

Why dont you check if you can just process the large file standalone and
then do the outer loop next.

sqlContext.read.json(jsonFile) .select($"some", $"fields") .withColumn(
"new_col", some_transformations($"col")) .rdd.map( x: Row => (k, v) )
.combineByKey()

Deenar

On 14 October 2015 at 05:18, SLiZn Liu  wrote:

> Hey Spark Users,
>
> I kept getting java.lang.OutOfMemoryError: Java heap space as I read a
> massive amount of json files, iteratively via read.json(). Even the
> result RDD is rather small, I still get the OOM Error. The brief structure
> of my program reads as following, in psuedo-code:
>
> file_path_list.map{ jsonFile: String =>
>   sqlContext.read.json(jsonFile)
> .select($"some", $"fields")
> .withColumn("new_col", some_transformations($"col"))
> .rdd.map( x: Row => (k, v) )
> .combineByKey() // which groups a column into item lists by another 
> column as keys
> }.reduce( (i, j) => i.union(j) )
> .combineByKey() // which combines results from all json files
>
> I confess some of the json files are Gigabytes huge, yet the combined RDD
> is in a few Megabytes. I’m not familiar with the under-the-hood mechanism,
> but my intuitive understanding of how the code executes is, read the file
> once a time (where I can easily modify map to foreach when fetching from
> file_path_list, if that’s the case), do the inner transformation on DF
> and combine, then reduce and do the outer combine immediately, which
> doesn’t require to hold all RDDs generated from all files in the memory.
> Obviously, as my code raises OOM Error, I must have missed something
> important.
>
> From the debug log, I can tell the OOM Error happens when reading the same
> file, which is in a modest size of 2GB, while driver.memory is set to 13GB,
> and the available memory size before the code execution is around 8GB, on
> my standalone machine running as “local[8]”.
>
> To overcome this, I also tried to initialize an empty universal RDD
> variable, iteratively read one file at a time using foreach, then instead
> of reduce, simply combine each RDD generated by the json files, except the
> OOM Error remains.
>
> Other configurations:
>
>- set(“spark.storage.memoryFraction”, “0.1”) // no cache of RDD is used
>- set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)
>
> Any suggestions other than scale up/out the spark cluster?
>
> BR,
> Todd Leo
> ​
>


Re: OutOfMemoryError When Reading Many json Files

2015-10-14 Thread SLiZn Liu
Yes it went wrong when processing a large file only. I removed
transformations on DF, and it worked just fine. But doing a simple filter
operation on the DF became the last straw that breaks the camel’s back.
That’s confusing.
​

On Wed, Oct 14, 2015 at 2:11 PM Deenar Toraskar 
wrote:

> Hi
>
> Why dont you check if you can just process the large file standalone and
> then do the outer loop next.
>
> sqlContext.read.json(jsonFile) .select($"some", $"fields") .withColumn(
> "new_col", some_transformations($"col")) .rdd.map( x: Row => (k, v) )
> .combineByKey()
>
> Deenar
>
> On 14 October 2015 at 05:18, SLiZn Liu  wrote:
>
>> Hey Spark Users,
>>
>> I kept getting java.lang.OutOfMemoryError: Java heap space as I read a
>> massive amount of json files, iteratively via read.json(). Even the
>> result RDD is rather small, I still get the OOM Error. The brief structure
>> of my program reads as following, in psuedo-code:
>>
>> file_path_list.map{ jsonFile: String =>
>>   sqlContext.read.json(jsonFile)
>> .select($"some", $"fields")
>> .withColumn("new_col", some_transformations($"col"))
>> .rdd.map( x: Row => (k, v) )
>> .combineByKey() // which groups a column into item lists by another 
>> column as keys
>> }.reduce( (i, j) => i.union(j) )
>> .combineByKey() // which combines results from all json files
>>
>> I confess some of the json files are Gigabytes huge, yet the combined RDD
>> is in a few Megabytes. I’m not familiar with the under-the-hood mechanism,
>> but my intuitive understanding of how the code executes is, read the file
>> once a time (where I can easily modify map to foreach when fetching from
>> file_path_list, if that’s the case), do the inner transformation on DF
>> and combine, then reduce and do the outer combine immediately, which
>> doesn’t require to hold all RDDs generated from all files in the memory.
>> Obviously, as my code raises OOM Error, I must have missed something
>> important.
>>
>> From the debug log, I can tell the OOM Error happens when reading the
>> same file, which is in a modest size of 2GB, while driver.memory is set to
>> 13GB, and the available memory size before the code execution is around
>> 8GB, on my standalone machine running as “local[8]”.
>>
>> To overcome this, I also tried to initialize an empty universal RDD
>> variable, iteratively read one file at a time using foreach, then
>> instead of reduce, simply combine each RDD generated by the json files,
>> except the OOM Error remains.
>>
>> Other configurations:
>>
>>- set(“spark.storage.memoryFraction”, “0.1”) // no cache of RDD is
>>used
>>- set(“spark.serializer”,
>>“org.apache.spark.serializer.KryoSerializer”)
>>
>> Any suggestions other than scale up/out the spark cluster?
>>
>> BR,
>> Todd Leo
>> ​
>>
>
>


OutOfMemoryError When Reading Many json Files

2015-10-13 Thread SLiZn Liu
Hey Spark Users,

I kept getting java.lang.OutOfMemoryError: Java heap space as I read a
massive amount of json files, iteratively via read.json(). Even the result
RDD is rather small, I still get the OOM Error. The brief structure of my
program reads as following, in psuedo-code:

file_path_list.map{ jsonFile: String =>
  sqlContext.read.json(jsonFile)
.select($"some", $"fields")
.withColumn("new_col", some_transformations($"col"))
.rdd.map( x: Row => (k, v) )
.combineByKey() // which groups a column into item lists by
another column as keys
}.reduce( (i, j) => i.union(j) )
.combineByKey() // which combines results from all json files

I confess some of the json files are Gigabytes huge, yet the combined RDD
is in a few Megabytes. I’m not familiar with the under-the-hood mechanism,
but my intuitive understanding of how the code executes is, read the file
once a time (where I can easily modify map to foreach when fetching from
file_path_list, if that’s the case), do the inner transformation on DF and
combine, then reduce and do the outer combine immediately, which doesn’t
require to hold all RDDs generated from all files in the memory. Obviously,
as my code raises OOM Error, I must have missed something important.

>From the debug log, I can tell the OOM Error happens when reading the same
file, which is in a modest size of 2GB, while driver.memory is set to 13GB,
and the available memory size before the code execution is around 8GB, on
my standalone machine running as “local[8]”.

To overcome this, I also tried to initialize an empty universal RDD
variable, iteratively read one file at a time using foreach, then instead
of reduce, simply combine each RDD generated by the json files, except the
OOM Error remains.

Other configurations:

   - set(“spark.storage.memoryFraction”, “0.1”) // no cache of RDD is used
   - set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)

Any suggestions other than scale up/out the spark cluster?

BR,
Todd Leo
​


OutOfMemoryError OOM ByteArrayOutputStream.hugeCapacity

2015-10-12 Thread Alexander Pivovarov
I have one job which fails if I enable KryoSerializer

I use spark 1.5.0 on emr-4.1.0

Settings:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max  1024m
spark.executor.memory47924M
spark.yarn.executor.memoryOverhead 5324


The job works fine if I keep default spark.serializer BUT it fails if I use
KryoSerializer. I tried to increase kryoserializer buffer max to 1024m -
still having OOM error.


java.lang.OutOfMemoryError
at
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at
org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:294)
at
org.xerial.snappy.SnappyOutputStream.compressInput(SnappyOutputStream.java:306)
at
org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:245)
at
org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:107)
at
org.apache.spark.io.SnappyOutputStreamWrapper.write(CompressionCodec.scala:189)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477)
at com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596)
at
com.esotericsoftware.kryo.serializers.DefaultSerializers$DoubleSerializer.write(DefaultSerializers.java:137)
at
com.esotericsoftware.kryo.serializers.DefaultSerializers$DoubleSerializer.write(DefaultSerializers.java:131)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:576)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:21)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:19)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:21)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:19)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:21)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:19)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:21)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:19)


Re: OutOfMemoryError

2015-10-09 Thread Ramkumar V
How to increase the Xmx of the workers ?

*Thanks*,



On Mon, Oct 5, 2015 at 3:48 PM, Ramkumar V  wrote:

> No. I didn't try to increase xmx.
>
> *Thanks*,
> 
>
>
> On Mon, Oct 5, 2015 at 1:36 PM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi Ramkumar,
>>
>> did you try to increase Xmx of the workers ?
>>
>> Regards
>> JB
>>
>> On 10/05/2015 08:56 AM, Ramkumar V wrote:
>>
>>> Hi,
>>>
>>> When i submit java spark job in cluster mode, i'm getting following
>>> exception.
>>>
>>> *LOG TRACE :*
>>>
>>> INFO yarn.ExecutorRunnable: Setting up executor with commands:
>>> List({{JAVA_HOME}}/bin/java, -server, -XX:OnOutOfMemoryError='kill
>>>   %p', -Xms1024m, -Xmx1024m, -Djava.io.tmpdir={{PWD}}/tmp,
>>> '-Dspark.ui.port=0', '-Dspark.driver.port=48309',
>>> -Dspark.yarn.app.container.log.dir=>> _DIR>, org.apache.spark.executor.CoarseGrainedExecutorBackend,
>>> --driver-url, akka.tcp://sparkDriver@ip
>>> :port/user/CoarseGrainedScheduler,
>>>   --executor-id, 2, --hostname, hostname , --cores, 1, --app-id,
>>> application_1441965028669_9009, --user-class-path, file:$PWD
>>> /__app__.jar, --user-class-path, file:$PWD/json-20090211.jar, 1>,
>>> /stdout, 2>, /stderr).
>>>
>>> I have a cluster of 11 machines (9 - 64 GB memory and 2 - 32 GB memory
>>> ). my input data of size 128 GB.
>>>
>>> How to solve this exception ? is it depends on driver.memory and
>>> execuitor.memory setting ?
>>>
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: OutOfMemoryError

2015-10-09 Thread Ted Yu
You can add it in in conf/spark-defaults.conf

 # spark.executor.extraJavaOptions  -XX:+PrintGCDetails

FYI

On Fri, Oct 9, 2015 at 3:07 AM, Ramkumar V  wrote:

> How to increase the Xmx of the workers ?
>
> *Thanks*,
> 
>
>
> On Mon, Oct 5, 2015 at 3:48 PM, Ramkumar V 
> wrote:
>
>> No. I didn't try to increase xmx.
>>
>> *Thanks*,
>> 
>>
>>
>> On Mon, Oct 5, 2015 at 1:36 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi Ramkumar,
>>>
>>> did you try to increase Xmx of the workers ?
>>>
>>> Regards
>>> JB
>>>
>>> On 10/05/2015 08:56 AM, Ramkumar V wrote:
>>>
 Hi,

 When i submit java spark job in cluster mode, i'm getting following
 exception.

 *LOG TRACE :*

 INFO yarn.ExecutorRunnable: Setting up executor with commands:
 List({{JAVA_HOME}}/bin/java, -server, -XX:OnOutOfMemoryError='kill
   %p', -Xms1024m, -Xmx1024m, -Djava.io.tmpdir={{PWD}}/tmp,
 '-Dspark.ui.port=0', '-Dspark.driver.port=48309',
 -Dspark.yarn.app.container.log.dir=>>> _DIR>, org.apache.spark.executor.CoarseGrainedExecutorBackend,
 --driver-url, akka.tcp://sparkDriver@ip
 :port/user/CoarseGrainedScheduler,
   --executor-id, 2, --hostname, hostname , --cores, 1, --app-id,
 application_1441965028669_9009, --user-class-path, file:$PWD
 /__app__.jar, --user-class-path, file:$PWD/json-20090211.jar, 1>,
 /stdout, 2>, /stderr).

 I have a cluster of 11 machines (9 - 64 GB memory and 2 - 32 GB memory
 ). my input data of size 128 GB.

 How to solve this exception ? is it depends on driver.memory and
 execuitor.memory setting ?


 *Thanks*,
 


>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


OutOfMemoryError

2015-10-05 Thread Ramkumar V
Hi,

When i submit java spark job in cluster mode, i'm getting following
exception.

*LOG TRACE :*

INFO yarn.ExecutorRunnable: Setting up executor with commands:
List({{JAVA_HOME}}/bin/java, -server, -XX:OnOutOfMemoryError='kill
 %p', -Xms1024m, -Xmx1024m, -Djava.io.tmpdir={{PWD}}/tmp,
'-Dspark.ui.port=0', '-Dspark.driver.port=48309',
-Dspark.yarn.app.container.log.dir=, org.apache.spark.executor.CoarseGrainedExecutorBackend,
--driver-url, akka.tcp://sparkDriver@ip:port/user/CoarseGrainedScheduler,
 --executor-id, 2, --hostname, hostname , --cores, 1, --app-id,
application_1441965028669_9009, --user-class-path, file:$PWD
/__app__.jar, --user-class-path, file:$PWD/json-20090211.jar, 1>,
/stdout, 2>, /stderr).

I have a cluster of 11 machines (9 - 64 GB memory and 2 - 32 GB memory ).
my input data of size 128 GB.

How to solve this exception ? is it depends on driver.memory and
execuitor.memory setting ?


*Thanks*,



Re: OutOfMemoryError

2015-10-05 Thread Jean-Baptiste Onofré

Hi Ramkumar,

did you try to increase Xmx of the workers ?

Regards
JB

On 10/05/2015 08:56 AM, Ramkumar V wrote:

Hi,

When i submit java spark job in cluster mode, i'm getting following
exception.

*LOG TRACE :*

INFO yarn.ExecutorRunnable: Setting up executor with commands:
List({{JAVA_HOME}}/bin/java, -server, -XX:OnOutOfMemoryError='kill
  %p', -Xms1024m, -Xmx1024m, -Djava.io.tmpdir={{PWD}}/tmp,
'-Dspark.ui.port=0', '-Dspark.driver.port=48309',
-Dspark.yarn.app.container.log.dir=, org.apache.spark.executor.CoarseGrainedExecutorBackend,
--driver-url, akka.tcp://sparkDriver@ip:port/user/CoarseGrainedScheduler,
  --executor-id, 2, --hostname, hostname , --cores, 1, --app-id,
application_1441965028669_9009, --user-class-path, file:$PWD
/__app__.jar, --user-class-path, file:$PWD/json-20090211.jar, 1>,
/stdout, 2>, /stderr).

I have a cluster of 11 machines (9 - 64 GB memory and 2 - 32 GB memory
). my input data of size 128 GB.

How to solve this exception ? is it depends on driver.memory and
execuitor.memory setting ?


*Thanks*,




--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: OutOfMemoryError

2015-10-05 Thread Ramkumar V
No. I didn't try to increase xmx.

*Thanks*,



On Mon, Oct 5, 2015 at 1:36 PM, Jean-Baptiste Onofré 
wrote:

> Hi Ramkumar,
>
> did you try to increase Xmx of the workers ?
>
> Regards
> JB
>
> On 10/05/2015 08:56 AM, Ramkumar V wrote:
>
>> Hi,
>>
>> When i submit java spark job in cluster mode, i'm getting following
>> exception.
>>
>> *LOG TRACE :*
>>
>> INFO yarn.ExecutorRunnable: Setting up executor with commands:
>> List({{JAVA_HOME}}/bin/java, -server, -XX:OnOutOfMemoryError='kill
>>   %p', -Xms1024m, -Xmx1024m, -Djava.io.tmpdir={{PWD}}/tmp,
>> '-Dspark.ui.port=0', '-Dspark.driver.port=48309',
>> -Dspark.yarn.app.container.log.dir=> _DIR>, org.apache.spark.executor.CoarseGrainedExecutorBackend,
>> --driver-url, akka.tcp://sparkDriver@ip:port/user/CoarseGrainedScheduler,
>>   --executor-id, 2, --hostname, hostname , --cores, 1, --app-id,
>> application_1441965028669_9009, --user-class-path, file:$PWD
>> /__app__.jar, --user-class-path, file:$PWD/json-20090211.jar, 1>,
>> /stdout, 2>, /stderr).
>>
>> I have a cluster of 11 machines (9 - 64 GB memory and 2 - 32 GB memory
>> ). my input data of size 128 GB.
>>
>> How to solve this exception ? is it depends on driver.memory and
>> execuitor.memory setting ?
>>
>>
>> *Thanks*,
>> 
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


What happens to this RDD? OutOfMemoryError

2015-09-04 Thread Kevin Mandich
Hi All,

I'm using PySpark to create a corpus of labeled data points. I create an
RDD called corpus, and then join to this RDD each newly-created feature RDD
as I go. My code repeats something like this for each feature:

feature = raw_data_rdd.map(...).reduceByKey(...).map(...) # create feature
RDD
corpus = corpus.join(feature).map(lambda x: (x[0], x[1][0] + (x[1][1],)) #
"append" new feature to existing corpus

The corpus RDD is a key-value tuple, where the key is the label and the
value is a tuple of the features. I repeat the above for the 6 features i'm
working with. It looks like I'm running into a memory error when performing
the join on the last feature. Here's some relevant information:

- raw_data_rdd has ~ 50 million entries, while feature and corpus have ~
450k after the map-reduce operations
- The driver and each of the 6 executor nodes have 6GB memory available
- I'm kicking off the script using the following:
pyspark --driver-memory 2G --executor-memory 2G --conf
spark.akka.frameSize=64 create_corpus.py

My question is: why would I be running out of memory when joining the
relatively small feature and corpus RRDs? Also, what happens to the "old"
corpus RDD when I join it and point corpus to the new, larger RDD? Does
this stay in memory, and could this be the reason why i'm running into the
issue? If so, is there a better way of "appending" to my corpus RDD? Should
I be persisting raw_data_rdd? The full error is shown below.

Please let me know if I'm missing something obvious. Thank you!

Kevin Mandich


Exception in thread "refresh progress" Exception in thread "SparkListenerBus"
[2015-09-04 20:43:14,385] {bash_operator.py:58} INFO - Exception:
java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in
thread "SparkListenerBus"
[2015-09-04 20:43:30,999] {bash_operator.py:58} INFO - Exception in
thread "qtp268929808-35" java.lang.OutOfMemoryError: Java heap space
[2015-09-04 20:43:30,999] {bash_operator.py:58} INFO - at
java.util.concurrent.locks.AbstractQueuedSynchronizer.addWaiter(AbstractQueuedSynchronizer.java:606)
[2015-09-04 20:43:30,999] {bash_operator.py:58} INFO - at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:883)
[2015-09-04 20:43:31,000] {bash_operator.py:58} INFO - at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1221)
[2015-09-04 20:43:32,562] {bash_operator.py:58} INFO - at
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
[2015-09-04 20:43:32,562] {bash_operator.py:58} INFO - at
org.spark-project.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:333)
[2015-09-04 20:43:32,563] {bash_operator.py:58} INFO - at
org.spark-project.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
[2015-09-04 20:43:32,563] {bash_operator.py:58} INFO - at
org.spark-project.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
[2015-09-04 20:43:32,563] {bash_operator.py:58} INFO - at
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
[2015-09-04 20:43:32,563] {bash_operator.py:58} INFO - at
java.lang.Thread.run(Thread.java:745)
[2015-09-04 20:43:32,563] {bash_operator.py:58} INFO -
java.lang.OutOfMemoryError: Java heap space
[2015-09-04 20:43:32,563] {bash_operator.py:58} INFO - Exception in
thread "qtp1514449570-77" java.lang.OutOfMemoryError: Java heap space
[2015-09-04 20:43:37,366] {bash_operator.py:58} INFO - at
java.util.concurrent.ConcurrentHashMap$KeySet.iterator(ConcurrentHashMap.java:1428)
[2015-09-04 20:43:37,366] {bash_operator.py:58} INFO - at
org.spark-project.jetty.io.nio.SelectorManager$SelectSet$1.run(SelectorManager.java:712)
[2015-09-04 20:43:37,366] {bash_operator.py:58} INFO - at
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
[2015-09-04 20:43:41,458] {bash_operator.py:58} INFO - at
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
[2015-09-04 20:43:41,459] {bash_operator.py:58} INFO - at
java.lang.Thread.run(Thread.java:745)
[2015-09-04 20:55:04,411] {bash_operator.py:58} INFO - Exception in
thread "qtp1514449570-72"
[2015-09-04 20:55:04,412] {bash_operator.py:58} INFO - Exception:
java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in
thread "qtp1514449570-72"
[2015-09-04 20:58:25,671] {bash_operator.py:58} INFO - Exception in
thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java
heap space


Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-26 Thread Mike Trienis
Thanks for your response Yana,

I can increase the MaxPermSize parameter and it will allow me to run the
unit test a few more times before I run out of memory.

However, the primary issue is that running the same unit test in the same
JVM (multiple times) results in increased memory (each run of the unit
test) and I believe it has something to do with HiveContext not reclaiming
memory after it is finished (or I'm not shutting it down properly).

It could very well be related to sbt, however, it's not clear to me.


On Tue, Aug 25, 2015 at 1:12 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 The PermGen space error is controlled with MaxPermSize parameter. I run
 with this in my pom, I think copied pretty literally from Spark's own
 tests... I don't know what the sbt equivalent is but you should be able to
 pass it...possibly via SBT_OPTS?


  plugin
   groupIdorg.scalatest/groupId
   artifactIdscalatest-maven-plugin/artifactId
   version1.0/version
   configuration

 reportsDirectory${project.build.directory}/surefire-reports/reportsDirectory
   parallelfalse/parallel
   junitxml./junitxml
   filereportsSparkTestSuite.txt/filereports
   argLine-Xmx3g -XX:MaxPermSize=256m
 -XX:ReservedCodeCacheSize=512m/argLine
   stderr/
   systemProperties
   java.awt.headlesstrue/java.awt.headless
   spark.testing1/spark.testing
   spark.ui.enabledfalse/spark.ui.enabled

 spark.driver.allowMultipleContextstrue/spark.driver.allowMultipleContexts
   /systemProperties
   /configuration
   executions
   execution
   idtest/id
   goals
   goaltest/goal
   /goals
   /execution
   /executions
   /plugin
   /plugins


 On Tue, Aug 25, 2015 at 2:10 PM, Mike Trienis mike.trie...@orcsol.com
 wrote:

 Hello,

 I am using sbt and created a unit test where I create a `HiveContext` and
 execute some query and then return. Each time I run the unit test the JVM
 will increase it's memory usage until I get the error:

 Internal error when running tests: java.lang.OutOfMemoryError: PermGen
 space
 Exception in thread Thread-2 java.io.EOFException

 As a work-around, I can fork a new JVM each time I run the unit test,
 however, it seems like a bad solution as takes a while to run the unit
 test.

 By the way, I tried to importing the TestHiveContext:

- import org.apache.spark.sql.hive.test.TestHiveContext

 However, it suffers from the same memory issue. Has anyone else suffered
 from the same problem? Note that I am running these unit tests on my mac.

 Cheers, Mike.





Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-26 Thread Michael Armbrust
I'd suggest setting sbt to fork when running tests.

On Wed, Aug 26, 2015 at 10:51 AM, Mike Trienis mike.trie...@orcsol.com
wrote:

 Thanks for your response Yana,

 I can increase the MaxPermSize parameter and it will allow me to run the
 unit test a few more times before I run out of memory.

 However, the primary issue is that running the same unit test in the same
 JVM (multiple times) results in increased memory (each run of the unit
 test) and I believe it has something to do with HiveContext not reclaiming
 memory after it is finished (or I'm not shutting it down properly).

 It could very well be related to sbt, however, it's not clear to me.


 On Tue, Aug 25, 2015 at 1:12 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 The PermGen space error is controlled with MaxPermSize parameter. I run
 with this in my pom, I think copied pretty literally from Spark's own
 tests... I don't know what the sbt equivalent is but you should be able to
 pass it...possibly via SBT_OPTS?


  plugin
   groupIdorg.scalatest/groupId
   artifactIdscalatest-maven-plugin/artifactId
   version1.0/version
   configuration

 reportsDirectory${project.build.directory}/surefire-reports/reportsDirectory
   parallelfalse/parallel
   junitxml./junitxml
   filereportsSparkTestSuite.txt/filereports
   argLine-Xmx3g -XX:MaxPermSize=256m
 -XX:ReservedCodeCacheSize=512m/argLine
   stderr/
   systemProperties
   java.awt.headlesstrue/java.awt.headless
   spark.testing1/spark.testing
   spark.ui.enabledfalse/spark.ui.enabled

 spark.driver.allowMultipleContextstrue/spark.driver.allowMultipleContexts
   /systemProperties
   /configuration
   executions
   execution
   idtest/id
   goals
   goaltest/goal
   /goals
   /execution
   /executions
   /plugin
   /plugins


 On Tue, Aug 25, 2015 at 2:10 PM, Mike Trienis mike.trie...@orcsol.com
 wrote:

 Hello,

 I am using sbt and created a unit test where I create a `HiveContext`
 and execute some query and then return. Each time I run the unit test the
 JVM will increase it's memory usage until I get the error:

 Internal error when running tests: java.lang.OutOfMemoryError: PermGen
 space
 Exception in thread Thread-2 java.io.EOFException

 As a work-around, I can fork a new JVM each time I run the unit test,
 however, it seems like a bad solution as takes a while to run the unit
 test.

 By the way, I tried to importing the TestHiveContext:

- import org.apache.spark.sql.hive.test.TestHiveContext

 However, it suffers from the same memory issue. Has anyone else suffered
 from the same problem? Note that I am running these unit tests on my mac.

 Cheers, Mike.






How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-25 Thread Mike Trienis
Hello,

I am using sbt and created a unit test where I create a `HiveContext` and
execute some query and then return. Each time I run the unit test the JVM
will increase it's memory usage until I get the error:

Internal error when running tests: java.lang.OutOfMemoryError: PermGen space
Exception in thread Thread-2 java.io.EOFException

As a work-around, I can fork a new JVM each time I run the unit test,
however, it seems like a bad solution as takes a while to run the unit
test.

By the way, I tried to importing the TestHiveContext:

   - import org.apache.spark.sql.hive.test.TestHiveContext

However, it suffers from the same memory issue. Has anyone else suffered
from the same problem? Note that I am running these unit tests on my mac.

Cheers, Mike.


Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-25 Thread Yana Kadiyska
The PermGen space error is controlled with MaxPermSize parameter. I run
with this in my pom, I think copied pretty literally from Spark's own
tests... I don't know what the sbt equivalent is but you should be able to
pass it...possibly via SBT_OPTS?


 plugin
  groupIdorg.scalatest/groupId
  artifactIdscalatest-maven-plugin/artifactId
  version1.0/version
  configuration

reportsDirectory${project.build.directory}/surefire-reports/reportsDirectory
  parallelfalse/parallel
  junitxml./junitxml
  filereportsSparkTestSuite.txt/filereports
  argLine-Xmx3g -XX:MaxPermSize=256m
-XX:ReservedCodeCacheSize=512m/argLine
  stderr/
  systemProperties
  java.awt.headlesstrue/java.awt.headless
  spark.testing1/spark.testing
  spark.ui.enabledfalse/spark.ui.enabled

spark.driver.allowMultipleContextstrue/spark.driver.allowMultipleContexts
  /systemProperties
  /configuration
  executions
  execution
  idtest/id
  goals
  goaltest/goal
  /goals
  /execution
  /executions
  /plugin
  /plugins


On Tue, Aug 25, 2015 at 2:10 PM, Mike Trienis mike.trie...@orcsol.com
wrote:

 Hello,

 I am using sbt and created a unit test where I create a `HiveContext` and
 execute some query and then return. Each time I run the unit test the JVM
 will increase it's memory usage until I get the error:

 Internal error when running tests: java.lang.OutOfMemoryError: PermGen
 space
 Exception in thread Thread-2 java.io.EOFException

 As a work-around, I can fork a new JVM each time I run the unit test,
 however, it seems like a bad solution as takes a while to run the unit
 test.

 By the way, I tried to importing the TestHiveContext:

- import org.apache.spark.sql.hive.test.TestHiveContext

 However, it suffers from the same memory issue. Has anyone else suffered
 from the same problem? Note that I am running these unit tests on my mac.

 Cheers, Mike.




Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
That looks like it's during recovery from a checkpoint, so it'd be driver
memory not executor memory.

How big is the checkpoint directory that you're trying to restore from?

On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg 
dgoldenberg...@gmail.com wrote:

 We're getting the below error.  Tried increasing spark.executor.memory
 e.g. from 1g to 2g but the below error still happens.

 Any recommendations? Something to do with specifying -Xmx in the submit
 job scripts?

 Thanks.

 Exception in thread main java.lang.OutOfMemoryError: GC overhead limit
 exceeded
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 at
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
 at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
 at org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData.restore(DirectKafkaInputDStream.scala:153)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:402)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:149)






Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Ted Yu
I wonder during recovery from a checkpoint whether we can estimate the size
of the checkpoint and compare with Runtime.getRuntime().freeMemory().

If the size of checkpoint is much bigger than free memory, log warning, etc

Cheers

On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg dgoldenberg...@gmail.com
 wrote:

 Thanks, Cody, will try that. Unfortunately due to a reinstall I don't have
 the original checkpointing directory :(  Thanks for the clarification on
 spark.driver.memory, I'll keep testing (at 2g things seem OK for now).

 On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That looks like it's during recovery from a checkpoint, so it'd be driver
 memory not executor memory.

 How big is the checkpoint directory that you're trying to restore from?

 On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 We're getting the below error.  Tried increasing spark.executor.memory
 e.g. from 1g to 2g but the below error still happens.

 Any recommendations? Something to do with specifying -Xmx in the submit
 job scripts?

 Thanks.

 Exception in thread main java.lang.OutOfMemoryError: GC overhead limit
 exceeded
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 at
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
 at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
 at org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData.restore(DirectKafkaInputDStream.scala:153)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:402)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:149)








How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
We're getting the below error.  Tried increasing spark.executor.memory e.g.
from 1g to 2g but the below error still happens.

Any recommendations? Something to do with specifying -Xmx in the submit job
scripts?

Thanks.

Exception in thread main java.lang.OutOfMemoryError: GC overhead limit
exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
at
org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
at
org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
at
org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
at
org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
at org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
at
org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
at
org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData.restore(DirectKafkaInputDStream.scala:153)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:402)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:149)


Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
Would there be a way to chunk up/batch up the contents of the checkpointing
directories as they're being processed by Spark Streaming?  Is it mandatory
to load the whole thing in one go?

On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu yuzhih...@gmail.com wrote:

 I wonder during recovery from a checkpoint whether we can estimate the
 size of the checkpoint and compare with Runtime.getRuntime().freeMemory().

 If the size of checkpoint is much bigger than free memory, log warning, etc

 Cheers

 On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Cody, will try that. Unfortunately due to a reinstall I don't
 have the original checkpointing directory :(  Thanks for the clarification
 on spark.driver.memory, I'll keep testing (at 2g things seem OK for now).

 On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That looks like it's during recovery from a checkpoint, so it'd be
 driver memory not executor memory.

 How big is the checkpoint directory that you're trying to restore from?

 On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 We're getting the below error.  Tried increasing spark.executor.memory
 e.g. from 1g to 2g but the below error still happens.

 Any recommendations? Something to do with specifying -Xmx in the submit
 job scripts?

 Thanks.

 Exception in thread main java.lang.OutOfMemoryError: GC overhead
 limit exceeded
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 at
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 at
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
 at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
 at org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData.restore(DirectKafkaInputDStream.scala:153)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:402)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:149)









Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
You need to keep a certain number of rdds around for checkpointing, based
on e.g. the window size.  Those would all need to be loaded at once.

On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg 
dgoldenberg...@gmail.com wrote:

 Would there be a way to chunk up/batch up the contents of the
 checkpointing directories as they're being processed by Spark Streaming?
 Is it mandatory to load the whole thing in one go?

 On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu yuzhih...@gmail.com wrote:

 I wonder during recovery from a checkpoint whether we can estimate the
 size of the checkpoint and compare with Runtime.getRuntime().freeMemory
 ().

 If the size of checkpoint is much bigger than free memory, log warning,
 etc

 Cheers

 On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Cody, will try that. Unfortunately due to a reinstall I don't
 have the original checkpointing directory :(  Thanks for the clarification
 on spark.driver.memory, I'll keep testing (at 2g things seem OK for now).

 On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That looks like it's during recovery from a checkpoint, so it'd be
 driver memory not executor memory.

 How big is the checkpoint directory that you're trying to restore from?

 On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 We're getting the below error.  Tried increasing spark.executor.memory
 e.g. from 1g to 2g but the below error still happens.

 Any recommendations? Something to do with specifying -Xmx in the
 submit job scripts?

 Thanks.

 Exception in thread main java.lang.OutOfMemoryError: GC overhead
 limit exceeded
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 at
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 at
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
 at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
 at org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData.restore(DirectKafkaInputDStream.scala:153)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:402)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:149)










Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Ted Yu
Looks like workaround is to reduce *window length.*

*Cheers*

On Mon, Aug 10, 2015 at 10:07 AM, Cody Koeninger c...@koeninger.org wrote:

 You need to keep a certain number of rdds around for checkpointing, based
 on e.g. the window size.  Those would all need to be loaded at once.

 On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Would there be a way to chunk up/batch up the contents of the
 checkpointing directories as they're being processed by Spark Streaming?
 Is it mandatory to load the whole thing in one go?

 On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu yuzhih...@gmail.com wrote:

 I wonder during recovery from a checkpoint whether we can estimate the
 size of the checkpoint and compare with Runtime.getRuntime().freeMemory
 ().

 If the size of checkpoint is much bigger than free memory, log warning,
 etc

 Cheers

 On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Cody, will try that. Unfortunately due to a reinstall I don't
 have the original checkpointing directory :(  Thanks for the clarification
 on spark.driver.memory, I'll keep testing (at 2g things seem OK for now).

 On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That looks like it's during recovery from a checkpoint, so it'd be
 driver memory not executor memory.

 How big is the checkpoint directory that you're trying to restore from?

 On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 We're getting the below error.  Tried increasing
 spark.executor.memory e.g. from 1g to 2g but the below error still 
 happens.

 Any recommendations? Something to do with specifying -Xmx in the
 submit job scripts?

 Thanks.

 Exception in thread main java.lang.OutOfMemoryError: GC overhead
 limit exceeded
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 at
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 at
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
 at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
 at org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData.restore(DirectKafkaInputDStream.scala:153)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:402)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:149)











Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
You need to keep a certain number of rdds around for checkpointing --
that seems like a hefty expense to pay in order to achieve fault
tolerance.  Why does Spark persist whole RDD's of data?  Shouldn't it be
sufficient to just persist the offsets, to know where to resume from?

Thanks.

On Mon, Aug 10, 2015 at 1:07 PM, Cody Koeninger c...@koeninger.org wrote:

 You need to keep a certain number of rdds around for checkpointing, based
 on e.g. the window size.  Those would all need to be loaded at once.

 On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Would there be a way to chunk up/batch up the contents of the
 checkpointing directories as they're being processed by Spark Streaming?
 Is it mandatory to load the whole thing in one go?

 On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu yuzhih...@gmail.com wrote:

 I wonder during recovery from a checkpoint whether we can estimate the
 size of the checkpoint and compare with Runtime.getRuntime().freeMemory
 ().

 If the size of checkpoint is much bigger than free memory, log warning,
 etc

 Cheers

 On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Cody, will try that. Unfortunately due to a reinstall I don't
 have the original checkpointing directory :(  Thanks for the clarification
 on spark.driver.memory, I'll keep testing (at 2g things seem OK for now).

 On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That looks like it's during recovery from a checkpoint, so it'd be
 driver memory not executor memory.

 How big is the checkpoint directory that you're trying to restore from?

 On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 We're getting the below error.  Tried increasing
 spark.executor.memory e.g. from 1g to 2g but the below error still 
 happens.

 Any recommendations? Something to do with specifying -Xmx in the
 submit job scripts?

 Thanks.

 Exception in thread main java.lang.OutOfMemoryError: GC overhead
 limit exceeded
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 at
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 at
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
 at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
 at org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData.restore(DirectKafkaInputDStream.scala:153)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:402)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:149)











Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
Well, RDDs also contain data, don't they?

The question is, what can be so hefty in the checkpointing directory to
cause Spark driver to run out of memory?  It seems that it makes
checkpointing expensive, in terms of I/O and memory consumption.  Two
network hops -- to driver, then to workers.  Hefty file system usage, hefty
memory consumption...   What can we do to offset some of these costs?



On Mon, Aug 10, 2015 at 4:27 PM, Cody Koeninger c...@koeninger.org wrote:

 The rdd is indeed defined by mostly just the offsets / topic partitions.

 On Mon, Aug 10, 2015 at 3:24 PM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 You need to keep a certain number of rdds around for checkpointing --
 that seems like a hefty expense to pay in order to achieve fault
 tolerance.  Why does Spark persist whole RDD's of data?  Shouldn't it be
 sufficient to just persist the offsets, to know where to resume from?

 Thanks.


 On Mon, Aug 10, 2015 at 1:07 PM, Cody Koeninger c...@koeninger.org
 wrote:

 You need to keep a certain number of rdds around for checkpointing,
 based on e.g. the window size.  Those would all need to be loaded at once.

 On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Would there be a way to chunk up/batch up the contents of the
 checkpointing directories as they're being processed by Spark Streaming?
 Is it mandatory to load the whole thing in one go?

 On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu yuzhih...@gmail.com wrote:

 I wonder during recovery from a checkpoint whether we can estimate
 the size of the checkpoint and compare with Runtime.getRuntime().
 freeMemory().

 If the size of checkpoint is much bigger than free memory, log
 warning, etc

 Cheers

 On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Cody, will try that. Unfortunately due to a reinstall I don't
 have the original checkpointing directory :(  Thanks for the 
 clarification
 on spark.driver.memory, I'll keep testing (at 2g things seem OK for now).

 On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That looks like it's during recovery from a checkpoint, so it'd be
 driver memory not executor memory.

 How big is the checkpoint directory that you're trying to restore
 from?

 On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 We're getting the below error.  Tried increasing
 spark.executor.memory e.g. from 1g to 2g but the below error still 
 happens.

 Any recommendations? Something to do with specifying -Xmx in the
 submit job scripts?

 Thanks.

 Exception in thread main java.lang.OutOfMemoryError: GC overhead
 limit exceeded
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 at
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 at
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at scala.Option.getOrElse(Option.scala:120)
 at
 org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
 at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
 at
 org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData.restore(DirectKafkaInputDStream.scala:153)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:402)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
The rdd is indeed defined by mostly just the offsets / topic partitions.

On Mon, Aug 10, 2015 at 3:24 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
 wrote:

 You need to keep a certain number of rdds around for checkpointing --
 that seems like a hefty expense to pay in order to achieve fault
 tolerance.  Why does Spark persist whole RDD's of data?  Shouldn't it be
 sufficient to just persist the offsets, to know where to resume from?

 Thanks.


 On Mon, Aug 10, 2015 at 1:07 PM, Cody Koeninger c...@koeninger.org
 wrote:

 You need to keep a certain number of rdds around for checkpointing, based
 on e.g. the window size.  Those would all need to be loaded at once.

 On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Would there be a way to chunk up/batch up the contents of the
 checkpointing directories as they're being processed by Spark Streaming?
 Is it mandatory to load the whole thing in one go?

 On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu yuzhih...@gmail.com wrote:

 I wonder during recovery from a checkpoint whether we can estimate the
 size of the checkpoint and compare with Runtime.getRuntime().freeMemory
 ().

 If the size of checkpoint is much bigger than free memory, log warning,
 etc

 Cheers

 On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Cody, will try that. Unfortunately due to a reinstall I don't
 have the original checkpointing directory :(  Thanks for the clarification
 on spark.driver.memory, I'll keep testing (at 2g things seem OK for now).

 On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That looks like it's during recovery from a checkpoint, so it'd be
 driver memory not executor memory.

 How big is the checkpoint directory that you're trying to restore
 from?

 On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 We're getting the below error.  Tried increasing
 spark.executor.memory e.g. from 1g to 2g but the below error still 
 happens.

 Any recommendations? Something to do with specifying -Xmx in the
 submit job scripts?

 Thanks.

 Exception in thread main java.lang.OutOfMemoryError: GC overhead
 limit exceeded
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 at
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 at
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
 at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
 at
 org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData.restore(DirectKafkaInputDStream.scala:153)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:402)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:149)











Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
No, it's not like a given KafkaRDD object contains an array of messages
that gets serialized with the object.  Its compute method generates an
iterator of messages as needed, by connecting to kafka.

I don't know what was so hefty in your checkpoint directory, because you
deleted it.  My checkpoint directories are usually pretty reasonable in
size.

How many topicpartitions did you have, and how long was your window?

On Mon, Aug 10, 2015 at 3:33 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
 wrote:

 Well, RDDs also contain data, don't they?

 The question is, what can be so hefty in the checkpointing directory to
 cause Spark driver to run out of memory?  It seems that it makes
 checkpointing expensive, in terms of I/O and memory consumption.  Two
 network hops -- to driver, then to workers.  Hefty file system usage, hefty
 memory consumption...   What can we do to offset some of these costs?



 On Mon, Aug 10, 2015 at 4:27 PM, Cody Koeninger c...@koeninger.org
 wrote:

 The rdd is indeed defined by mostly just the offsets / topic partitions.

 On Mon, Aug 10, 2015 at 3:24 PM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 You need to keep a certain number of rdds around for checkpointing --
 that seems like a hefty expense to pay in order to achieve fault
 tolerance.  Why does Spark persist whole RDD's of data?  Shouldn't it be
 sufficient to just persist the offsets, to know where to resume from?

 Thanks.


 On Mon, Aug 10, 2015 at 1:07 PM, Cody Koeninger c...@koeninger.org
 wrote:

 You need to keep a certain number of rdds around for checkpointing,
 based on e.g. the window size.  Those would all need to be loaded at once.

 On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Would there be a way to chunk up/batch up the contents of the
 checkpointing directories as they're being processed by Spark Streaming?
 Is it mandatory to load the whole thing in one go?

 On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu yuzhih...@gmail.com wrote:

 I wonder during recovery from a checkpoint whether we can estimate
 the size of the checkpoint and compare with Runtime.getRuntime().
 freeMemory().

 If the size of checkpoint is much bigger than free memory, log
 warning, etc

 Cheers

 On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Cody, will try that. Unfortunately due to a reinstall I
 don't have the original checkpointing directory :(  Thanks for the
 clarification on spark.driver.memory, I'll keep testing (at 2g things 
 seem
 OK for now).

 On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger c...@koeninger.org
  wrote:

 That looks like it's during recovery from a checkpoint, so it'd be
 driver memory not executor memory.

 How big is the checkpoint directory that you're trying to restore
 from?

 On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 We're getting the below error.  Tried increasing
 spark.executor.memory e.g. from 1g to 2g but the below error still 
 happens.

 Any recommendations? Something to do with specifying -Xmx in the
 submit job scripts?

 Thanks.

 Exception in thread main java.lang.OutOfMemoryError: GC overhead
 limit exceeded
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 at
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 at
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at scala.Option.getOrElse(Option.scala:120)
 at
 org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
 at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
 at
 org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
Hello Dean  Others,
Thanks for the response.

I tried with 100,200, 400, 600 and 1200 repartitions with 100,200,400 and
800 executors. Each time all the tasks of join complete in less than a
minute except one and that one tasks runs forever. I have a huge cluster at
my disposal.

The data for each of 1199 tasks is around 40MB/30k records and for 1 never
ending task is 1.5G/98million records. I see that there is data skew among
tasks. I had observed this a week earlier and i have no clue on how to fix
it and when someone suggested that repartition might make things more
parallel, but the problem is still persistent.

Please suggest on how to get the task to complete.
All i want to do is join two datasets. (dataset1 is in sequence file and
dataset2 is in avro format).



Ex:
Tasks IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch Time
DurationGC TimeShuffle Read Size / RecordsShuffle Spill (Memory)Shuffle
Spill (Disk)Errors  0 3771 0 RUNNING PROCESS_LOCAL 114 / host1 2015/05/04
01:27:44 7.3 min  19 s  1591.2 MB / 98931767  0.0 B 0.0 B   1 3772 0 SUCCESS
PROCESS_LOCAL 226 / host2 2015/05/04 01:27:44 28 s  2 s  39.2 MB / 29754
0.0 B 0.0 B   2 3773 0 SUCCESS PROCESS_LOCAL 283 / host3 2015/05/04 01:27:44 26
s  2 s  39.0 MB / 29646  0.0 B 0.0 B   5 3776 0 SUCCESS PROCESS_LOCAL 320
/ host4 2015/05/04 01:27:44 31 s  3 s  38.8 MB / 29512  0.0 B 0.0 B   4 3775
0 SUCCESS PROCESS_LOCAL 203 / host5 2015/05/04 01:27:44 41 s  3 s  38.4 MB
/ 29169  0.0 B 0.0 B   3 3774 0 SUCCESS PROCESS_LOCAL 84 / host6 2015/05/04
01:27:44 24 s  2 s  38.5 MB / 29258  0.0 B 0.0 B   8 3779 0 SUCCESS
PROCESS_LOCAL 309 / host7 2015/05/04 01:27:44 31 s  4 s  39.5 MB / 30008
0.0 B 0.0 B

There are 1200 tasks in total.


On Sun, May 3, 2015 at 9:53 PM, Dean Wampler deanwamp...@gmail.com wrote:

 I don't know the full context of what you're doing, but serialization
 errors usually mean you're attempting to serialize something that can't be
 serialized, like the SparkContext. Kryo won't help there.

 The arguments to spark-submit you posted previously look good:

 2)  --num-executors 96 --driver-memory 12g --driver-java-options
 -XX:MaxPermSize=10G --executor-memory 12g --executor-cores 4

 I suspect you aren't getting the parallelism you need. For partitioning,
 if your data is in HDFS and your block size is 128MB, then you'll get ~195
 partitions anyway. If it takes 7 hours to do a join over 25GB of data, you
 have some other serious bottleneck. You should examine the web console and
 the logs to determine where all the time is going. Questions you might
 pursue:

- How long does each task take to complete?
- How many of those 195 partitions/tasks are processed at the same
time? That is, how many slots are available?  Maybe you need more nodes
if the number of slots is too low. Based on your command arguments, you
should be able to process 1/2 of them at a time, unless the cluster is 
 busy.
- Is the cluster swamped with other work?
- How much data does each task process? Is the data roughly the same
from one task to the next? If not, then you might have serious key skew?

 You may also need to research the details of how joins are implemented and
 some of the common tricks for organizing data to minimize having to shuffle
 all N by M records.



 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Sun, May 3, 2015 at 11:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Hello Deam,
 If I don;t use Kryo serializer i got Serialization error and hence am
 using it.
 If I don';t use partitionBy/reparition then the simply join never
 completed even after 7 hours and infact as next step i need to run it
 against 250G as that is my full dataset size. Someone here suggested to me
 to use repartition.

 Assuming reparition is mandatory , how do i decide whats the right number
 ? When i am using 400 i do not get NullPointerException that i talked
 about, which is strange. I never saw that exception against small random
 dataset but see it with 25G and again with 400 partitions , i do not see it.


 On Sun, May 3, 2015 at 9:15 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 IMHO, you are trying waaay to hard to optimize work on what is really a
 small data set. 25G, even 250G, is not that much data, especially if you've
 spent a month trying to get something to work that should be simple. All
 these errors are from optimization attempts.

 Kryo is great, but if it's not working reliably for some reason, then
 don't use it. Rather than force 200 partitions, let Spark try to figure out
 a good-enough number. (If you really need to force a partition count, use
 the repartition method instead, unless you're overriding the partitioner.)

 So. I recommend that you eliminate all the optimizations: Kryo,
 partitionBy, etc. Just use the simplest 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread Saisai Shao
IMHO If your data or your algorithm is prone to data skew, I think you have
to fix this from application level, Spark itself cannot overcome this
problem (if one key has large amount of values), you may change your
algorithm to choose another shuffle key, somethings like this to avoid
shuffle on skewed keys.

2015-05-04 16:41 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:

 Hello Dean  Others,
 Thanks for the response.

 I tried with 100,200, 400, 600 and 1200 repartitions with 100,200,400 and
 800 executors. Each time all the tasks of join complete in less than a
 minute except one and that one tasks runs forever. I have a huge cluster at
 my disposal.

 The data for each of 1199 tasks is around 40MB/30k records and for 1 never
 ending task is 1.5G/98million records. I see that there is data skew among
 tasks. I had observed this a week earlier and i have no clue on how to fix
 it and when someone suggested that repartition might make things more
 parallel, but the problem is still persistent.

 Please suggest on how to get the task to complete.
 All i want to do is join two datasets. (dataset1 is in sequence file and
 dataset2 is in avro format).



 Ex:
 Tasks IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch Time
 DurationGC TimeShuffle Read Size / RecordsShuffle Spill (Memory)Shuffle
 Spill (Disk)Errors  0 3771 0 RUNNING PROCESS_LOCAL 114 / host1 2015/05/04
 01:27:44 7.3 min  19 s  1591.2 MB / 98931767  0.0 B 0.0 B   1 3772 0
 SUCCESS PROCESS_LOCAL 226 / host2 2015/05/04 01:27:44 28 s  2 s  39.2 MB
 / 29754  0.0 B 0.0 B   2 3773 0 SUCCESS PROCESS_LOCAL 283 / host3 2015/05/04
 01:27:44 26 s  2 s  39.0 MB / 29646  0.0 B 0.0 B   5 3776 0 SUCCESS
 PROCESS_LOCAL 320 / host4 2015/05/04 01:27:44 31 s  3 s  38.8 MB / 29512
 0.0 B 0.0 B   4 3775 0 SUCCESS PROCESS_LOCAL 203 / host5 2015/05/04
 01:27:44 41 s  3 s  38.4 MB / 29169  0.0 B 0.0 B   3 3774 0 SUCCESS
 PROCESS_LOCAL 84 / host6 2015/05/04 01:27:44 24 s  2 s  38.5 MB / 29258
 0.0 B 0.0 B   8 3779 0 SUCCESS PROCESS_LOCAL 309 / host7 2015/05/04
 01:27:44 31 s  4 s  39.5 MB / 30008  0.0 B 0.0 B

 There are 1200 tasks in total.


 On Sun, May 3, 2015 at 9:53 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 I don't know the full context of what you're doing, but serialization
 errors usually mean you're attempting to serialize something that can't be
 serialized, like the SparkContext. Kryo won't help there.

 The arguments to spark-submit you posted previously look good:

 2)  --num-executors 96 --driver-memory 12g --driver-java-options
 -XX:MaxPermSize=10G --executor-memory 12g --executor-cores 4

 I suspect you aren't getting the parallelism you need. For partitioning,
 if your data is in HDFS and your block size is 128MB, then you'll get ~195
 partitions anyway. If it takes 7 hours to do a join over 25GB of data, you
 have some other serious bottleneck. You should examine the web console and
 the logs to determine where all the time is going. Questions you might
 pursue:

- How long does each task take to complete?
- How many of those 195 partitions/tasks are processed at the same
time? That is, how many slots are available?  Maybe you need more nodes
if the number of slots is too low. Based on your command arguments, you
should be able to process 1/2 of them at a time, unless the cluster is 
 busy.
- Is the cluster swamped with other work?
- How much data does each task process? Is the data roughly the same
from one task to the next? If not, then you might have serious key skew?

 You may also need to research the details of how joins are implemented
 and some of the common tricks for organizing data to minimize having to
 shuffle all N by M records.



 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Sun, May 3, 2015 at 11:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Hello Deam,
 If I don;t use Kryo serializer i got Serialization error and hence am
 using it.
 If I don';t use partitionBy/reparition then the simply join never
 completed even after 7 hours and infact as next step i need to run it
 against 250G as that is my full dataset size. Someone here suggested to me
 to use repartition.

 Assuming reparition is mandatory , how do i decide whats the right
 number ? When i am using 400 i do not get NullPointerException that i
 talked about, which is strange. I never saw that exception against small
 random dataset but see it with 25G and again with 400 partitions , i do not
 see it.


 On Sun, May 3, 2015 at 9:15 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 IMHO, you are trying waaay to hard to optimize work on what is really a
 small data set. 25G, even 250G, is not that much data, especially if you've
 spent a month trying to get something to work that should be simple. All
 these errors are from optimization attempts.


Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
Hello Shao,
Can you talk more about shuffle key or point me to APIs that allow me to
change shuffle key. I will try with different keys and see the performance.

What is the shuffle key by default ?

On Mon, May 4, 2015 at 2:37 PM, Saisai Shao sai.sai.s...@gmail.com wrote:

 IMHO If your data or your algorithm is prone to data skew, I think you
 have to fix this from application level, Spark itself cannot overcome this
 problem (if one key has large amount of values), you may change your
 algorithm to choose another shuffle key, somethings like this to avoid
 shuffle on skewed keys.

 2015-05-04 16:41 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:

 Hello Dean  Others,
 Thanks for the response.

 I tried with 100,200, 400, 600 and 1200 repartitions with 100,200,400 and
 800 executors. Each time all the tasks of join complete in less than a
 minute except one and that one tasks runs forever. I have a huge cluster at
 my disposal.

 The data for each of 1199 tasks is around 40MB/30k records and for 1
 never ending task is 1.5G/98million records. I see that there is data skew
 among tasks. I had observed this a week earlier and i have no clue on how
 to fix it and when someone suggested that repartition might make things
 more parallel, but the problem is still persistent.

 Please suggest on how to get the task to complete.
 All i want to do is join two datasets. (dataset1 is in sequence file and
 dataset2 is in avro format).



 Ex:
 Tasks IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch Time
 DurationGC TimeShuffle Read Size / RecordsShuffle Spill (Memory)Shuffle
 Spill (Disk)Errors  0 3771 0 RUNNING PROCESS_LOCAL 114 / host1 2015/05/04
 01:27:44 7.3 min  19 s  1591.2 MB / 98931767  0.0 B 0.0 B   1 3772 0
 SUCCESS PROCESS_LOCAL 226 / host2 2015/05/04 01:27:44 28 s  2 s  39.2 MB
 / 29754  0.0 B 0.0 B   2 3773 0 SUCCESS PROCESS_LOCAL 283 / host3 2015/05/04
 01:27:44 26 s  2 s  39.0 MB / 29646  0.0 B 0.0 B   5 3776 0 SUCCESS
 PROCESS_LOCAL 320 / host4 2015/05/04 01:27:44 31 s  3 s  38.8 MB / 29512
 0.0 B 0.0 B   4 3775 0 SUCCESS PROCESS_LOCAL 203 / host5 2015/05/04
 01:27:44 41 s  3 s  38.4 MB / 29169  0.0 B 0.0 B   3 3774 0 SUCCESS
 PROCESS_LOCAL 84 / host6 2015/05/04 01:27:44 24 s  2 s  38.5 MB / 29258
 0.0 B 0.0 B   8 3779 0 SUCCESS PROCESS_LOCAL 309 / host7 2015/05/04
 01:27:44 31 s  4 s  39.5 MB / 30008  0.0 B 0.0 B

 There are 1200 tasks in total.


 On Sun, May 3, 2015 at 9:53 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 I don't know the full context of what you're doing, but serialization
 errors usually mean you're attempting to serialize something that can't be
 serialized, like the SparkContext. Kryo won't help there.

 The arguments to spark-submit you posted previously look good:

 2)  --num-executors 96 --driver-memory 12g --driver-java-options
 -XX:MaxPermSize=10G --executor-memory 12g --executor-cores 4

 I suspect you aren't getting the parallelism you need. For partitioning,
 if your data is in HDFS and your block size is 128MB, then you'll get ~195
 partitions anyway. If it takes 7 hours to do a join over 25GB of data, you
 have some other serious bottleneck. You should examine the web console and
 the logs to determine where all the time is going. Questions you might
 pursue:

- How long does each task take to complete?
- How many of those 195 partitions/tasks are processed at the same
time? That is, how many slots are available?  Maybe you need more nodes
if the number of slots is too low. Based on your command arguments, you
should be able to process 1/2 of them at a time, unless the cluster is 
 busy.
- Is the cluster swamped with other work?
- How much data does each task process? Is the data roughly the same
from one task to the next? If not, then you might have serious key skew?

 You may also need to research the details of how joins are implemented
 and some of the common tricks for organizing data to minimize having to
 shuffle all N by M records.



 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Sun, May 3, 2015 at 11:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Hello Deam,
 If I don;t use Kryo serializer i got Serialization error and hence am
 using it.
 If I don';t use partitionBy/reparition then the simply join never
 completed even after 7 hours and infact as next step i need to run it
 against 250G as that is my full dataset size. Someone here suggested to me
 to use repartition.

 Assuming reparition is mandatory , how do i decide whats the right
 number ? When i am using 400 i do not get NullPointerException that i
 talked about, which is strange. I never saw that exception against small
 random dataset but see it with 25G and again with 400 partitions , i do not
 see it.


 On Sun, May 3, 2015 at 9:15 PM, Dean Wampler 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread Saisai Shao
Shuffle key is depending on your implementation, I'm not sure if you are
familiar with MapReduce, the mapper output is a key-value pair, where the
key is the shuffle key for shuffling, Spark is also the same.

2015-05-04 17:31 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:

 Hello Shao,
 Can you talk more about shuffle key or point me to APIs that allow me to
 change shuffle key. I will try with different keys and see the performance.

 What is the shuffle key by default ?

 On Mon, May 4, 2015 at 2:37 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 IMHO If your data or your algorithm is prone to data skew, I think you
 have to fix this from application level, Spark itself cannot overcome this
 problem (if one key has large amount of values), you may change your
 algorithm to choose another shuffle key, somethings like this to avoid
 shuffle on skewed keys.

 2015-05-04 16:41 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:

 Hello Dean  Others,
 Thanks for the response.

 I tried with 100,200, 400, 600 and 1200 repartitions with 100,200,400
 and 800 executors. Each time all the tasks of join complete in less than a
 minute except one and that one tasks runs forever. I have a huge cluster at
 my disposal.

 The data for each of 1199 tasks is around 40MB/30k records and for 1
 never ending task is 1.5G/98million records. I see that there is data skew
 among tasks. I had observed this a week earlier and i have no clue on how
 to fix it and when someone suggested that repartition might make things
 more parallel, but the problem is still persistent.

 Please suggest on how to get the task to complete.
 All i want to do is join two datasets. (dataset1 is in sequence file and
 dataset2 is in avro format).



 Ex:
 Tasks IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch Time
 DurationGC TimeShuffle Read Size / RecordsShuffle Spill (Memory)Shuffle
 Spill (Disk)Errors  0 3771 0 RUNNING PROCESS_LOCAL 114 / host1 2015/05/04
 01:27:44 7.3 min  19 s  1591.2 MB / 98931767  0.0 B 0.0 B   1 3772 0
 SUCCESS PROCESS_LOCAL 226 / host2 2015/05/04 01:27:44 28 s  2 s  39.2
 MB / 29754  0.0 B 0.0 B   2 3773 0 SUCCESS PROCESS_LOCAL 283 / host3 
 2015/05/04
 01:27:44 26 s  2 s  39.0 MB / 29646  0.0 B 0.0 B   5 3776 0 SUCCESS
 PROCESS_LOCAL 320 / host4 2015/05/04 01:27:44 31 s  3 s  38.8 MB /
 29512  0.0 B 0.0 B   4 3775 0 SUCCESS PROCESS_LOCAL 203 / host5 2015/05/04
 01:27:44 41 s  3 s  38.4 MB / 29169  0.0 B 0.0 B   3 3774 0 SUCCESS
 PROCESS_LOCAL 84 / host6 2015/05/04 01:27:44 24 s  2 s  38.5 MB / 29258
 0.0 B 0.0 B   8 3779 0 SUCCESS PROCESS_LOCAL 309 / host7 2015/05/04
 01:27:44 31 s  4 s  39.5 MB / 30008  0.0 B 0.0 B

 There are 1200 tasks in total.


 On Sun, May 3, 2015 at 9:53 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 I don't know the full context of what you're doing, but serialization
 errors usually mean you're attempting to serialize something that can't be
 serialized, like the SparkContext. Kryo won't help there.

 The arguments to spark-submit you posted previously look good:

 2)  --num-executors 96 --driver-memory 12g --driver-java-options
 -XX:MaxPermSize=10G --executor-memory 12g --executor-cores 4

 I suspect you aren't getting the parallelism you need. For
 partitioning, if your data is in HDFS and your block size is 128MB, then
 you'll get ~195 partitions anyway. If it takes 7 hours to do a join over
 25GB of data, you have some other serious bottleneck. You should examine
 the web console and the logs to determine where all the time is going.
 Questions you might pursue:

- How long does each task take to complete?
- How many of those 195 partitions/tasks are processed at the same
time? That is, how many slots are available?  Maybe you need more 
 nodes
if the number of slots is too low. Based on your command arguments, you
should be able to process 1/2 of them at a time, unless the cluster is 
 busy.
- Is the cluster swamped with other work?
- How much data does each task process? Is the data roughly the
same from one task to the next? If not, then you might have serious key
skew?

 You may also need to research the details of how joins are implemented
 and some of the common tricks for organizing data to minimize having to
 shuffle all N by M records.



 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Sun, May 3, 2015 at 11:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Hello Deam,
 If I don;t use Kryo serializer i got Serialization error and hence am
 using it.
 If I don';t use partitionBy/reparition then the simply join never
 completed even after 7 hours and infact as next step i need to run it
 against 250G as that is my full dataset size. Someone here suggested to me
 to use repartition.

 Assuming reparition is mandatory , how do i decide whats the right
 number ? 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
One dataset (RDD Pair)

val lstgItem = listings.map { lstg = (lstg.getItemId().toLong, lstg) }

Second Dataset (RDDPair)

val viEvents = viEventsRaw.map { vi = (vi.get(14).asInstanceOf[Long], vi) }

As i want to join based on item Id that is used as first element in the
tuple in both cases and i think thats what is shuffle key.

listings == Data set contains all the unique item ids that are ever listed
on the ecommerce site.

viEvents === List of items viewed by user in last day. This will always be
a subset of the total set.

So i do not understand what is data skewness. When my long running task is
working on 1591.2 MB / 98,931,767 does that mean 98 million reocrds contain
all the same item ID ? How can millions of user look at the same item in
last day ?

Or does this dataset contain records across item ids ?


Regards,

Deepak




On Mon, May 4, 2015 at 3:08 PM, Saisai Shao sai.sai.s...@gmail.com wrote:

 Shuffle key is depending on your implementation, I'm not sure if you are
 familiar with MapReduce, the mapper output is a key-value pair, where the
 key is the shuffle key for shuffling, Spark is also the same.

 2015-05-04 17:31 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:

 Hello Shao,
 Can you talk more about shuffle key or point me to APIs that allow me to
 change shuffle key. I will try with different keys and see the performance.

 What is the shuffle key by default ?

 On Mon, May 4, 2015 at 2:37 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 IMHO If your data or your algorithm is prone to data skew, I think you
 have to fix this from application level, Spark itself cannot overcome this
 problem (if one key has large amount of values), you may change your
 algorithm to choose another shuffle key, somethings like this to avoid
 shuffle on skewed keys.

 2015-05-04 16:41 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:

 Hello Dean  Others,
 Thanks for the response.

 I tried with 100,200, 400, 600 and 1200 repartitions with 100,200,400
 and 800 executors. Each time all the tasks of join complete in less than a
 minute except one and that one tasks runs forever. I have a huge cluster at
 my disposal.

 The data for each of 1199 tasks is around 40MB/30k records and for 1
 never ending task is 1.5G/98million records. I see that there is data skew
 among tasks. I had observed this a week earlier and i have no clue on how
 to fix it and when someone suggested that repartition might make things
 more parallel, but the problem is still persistent.

 Please suggest on how to get the task to complete.
 All i want to do is join two datasets. (dataset1 is in sequence file
 and dataset2 is in avro format).



 Ex:
 Tasks IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch Time
 DurationGC TimeShuffle Read Size / RecordsShuffle Spill (Memory)Shuffle
 Spill (Disk)Errors  0 3771 0 RUNNING PROCESS_LOCAL 114 / host1 2015/05/04
 01:27:44 7.3 min  19 s  1591.2 MB / 98931767  0.0 B 0.0 B   1 3772 0
 SUCCESS PROCESS_LOCAL 226 / host2 2015/05/04 01:27:44 28 s  2 s  39.2
 MB / 29754  0.0 B 0.0 B   2 3773 0 SUCCESS PROCESS_LOCAL 283 / host3 
 2015/05/04
 01:27:44 26 s  2 s  39.0 MB / 29646  0.0 B 0.0 B   5 3776 0 SUCCESS
 PROCESS_LOCAL 320 / host4 2015/05/04 01:27:44 31 s  3 s  38.8 MB /
 29512  0.0 B 0.0 B   4 3775 0 SUCCESS PROCESS_LOCAL 203 / host5 2015/05/04
 01:27:44 41 s  3 s  38.4 MB / 29169  0.0 B 0.0 B   3 3774 0 SUCCESS
 PROCESS_LOCAL 84 / host6 2015/05/04 01:27:44 24 s  2 s  38.5 MB /
 29258  0.0 B 0.0 B   8 3779 0 SUCCESS PROCESS_LOCAL 309 / host7 2015/05/04
 01:27:44 31 s  4 s  39.5 MB / 30008  0.0 B 0.0 B

 There are 1200 tasks in total.


 On Sun, May 3, 2015 at 9:53 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 I don't know the full context of what you're doing, but serialization
 errors usually mean you're attempting to serialize something that can't be
 serialized, like the SparkContext. Kryo won't help there.

 The arguments to spark-submit you posted previously look good:

 2)  --num-executors 96 --driver-memory 12g --driver-java-options
 -XX:MaxPermSize=10G --executor-memory 12g --executor-cores 4

 I suspect you aren't getting the parallelism you need. For
 partitioning, if your data is in HDFS and your block size is 128MB, then
 you'll get ~195 partitions anyway. If it takes 7 hours to do a join over
 25GB of data, you have some other serious bottleneck. You should examine
 the web console and the logs to determine where all the time is going.
 Questions you might pursue:

- How long does each task take to complete?
- How many of those 195 partitions/tasks are processed at the same
time? That is, how many slots are available?  Maybe you need more 
 nodes
if the number of slots is too low. Based on your command arguments, you
should be able to process 1/2 of them at a time, unless the cluster is 
 busy.
- Is the cluster swamped with other work?
- How much data does each task process? Is the data roughly the
same from one task to the next? If not, then you 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
Four tasks are now failing with

IndexIDAttemptStatus ▾Locality LevelExecutor ID / HostLaunch TimeDurationGC
TimeShuffle Read Size / RecordsShuffle Spill (Memory)Shuffle Spill (Disk)
Errors  0 3771 0 FAILED PROCESS_LOCAL 114 / host1 2015/05/04 01:27:44
 /   ExecutorLostFailure
(executor 114 lost)  1007 4973 1 FAILED PROCESS_LOCAL 420 / host2 2015/05/04
02:13:14   /   FetchFailed(null, shuffleId=1, mapId=-1, reduceId=1007,
message= +details

FetchFailed(null, shuffleId=1, mapId=-1, reduceId=1007, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
output location for shuffle 1
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:381)
at 
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:177)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:127)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:127)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

)

 371 4972 1 FAILED PROCESS_LOCAL 563 / host3 2015/05/04 02:13:14   /
FetchFailed(null,
shuffleId=1, mapId=-1, reduceId=371, message= +details

FetchFailed(null, shuffleId=1, mapId=-1, reduceId=371, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
output location for shuffle 1
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread Saisai Shao
From the symptoms you mentioned that one task's shuffle write is much
larger than all the other task, it is quite similar to normal data skew
behavior, I just give some advice based on your descriptions, I think you
need to detect whether data is actually skewed or not.

The shuffle will put data with same partitioner strategy (default is hash
partitioner) into one task, so the same key data will be put into the same
task, but one task not just has only one key.

2015-05-04 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:

 Attached image shows the Spark UI for the job.





 On Mon, May 4, 2015 at 3:28 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Four tasks are now failing with

 IndexIDAttemptStatus ▾Locality LevelExecutor ID / HostLaunch TimeDurationGC
 TimeShuffle Read Size / RecordsShuffle Spill (Memory)Shuffle Spill (Disk)
 Errors  0 3771 0 FAILED PROCESS_LOCAL 114 / host1 2015/05/04 01:27:44   /
   ExecutorLostFailure (executor 114 lost)  1007 4973 1 FAILED
 PROCESS_LOCAL 420 / host2 2015/05/04 02:13:14   /   FetchFailed(null,
 shuffleId=1, mapId=-1, reduceId=1007, message= +details

 FetchFailed(null, shuffleId=1, mapId=-1, reduceId=1007, message=
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
 location for shuffle 1
  at 
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
  at 
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
  at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
  at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
  at 
 org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:381)
  at 
 org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:177)
  at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
  at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
  at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
  at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:127)
  at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
  at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
  at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:127)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
  at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
  at org.apache.spark.scheduler.Task.run(Task.scala:64)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
  at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)

 )

  371 4972 1 FAILED PROCESS_LOCAL 563 / host3 2015/05/04 02:13:14   /   
 FetchFailed(null,
 shuffleId=1, mapId=-1, reduceId=371, message= +details

 FetchFailed(null, shuffleId=1, mapId=-1, reduceId=371, message=
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
 location for shuffle 1
  at 
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
  at 
 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
I ran it against one file instead of 10 files and i see one task is still
running after 33 mins its shuffle read size is 780MB/50 mil records.

I did a count of records for each itemId from dataset-2 [One FILE] (Second
Dataset (RDDPair) val viEvents = viEventsRaw.map { vi = (vi.get(14
).asInstanceOf[Long], vi) } ). This is the dataset that contains the list
of items viewed by user in one day.

*Item IdCount*
201335783004 537
111654496030 353
141640703798 287
191568402102 258
111654479898 217
231521843148 211
251931716094 201
111654493548 181
181503913062 181
121635453050 152
261798565828 140
151494617682 139
251927181728 127
231516683056 119
141640492864 117
161677270656 117
171771073616 113
111649942124 109
191516989450 97
231539161292 94
221555628408 88
131497785968 87
121632233872 84
131335379184 83
281531363490 83
131492727742 79
231174157820 79
161666914810 77
251699753072 77
161683664300 76


I was assuming that data-skew would be if the top item(201335783004) had a
count of 1 million, however its only few hundreds, then why is Spark
skewing it in join ? What should i do that Spark distributes the records
evenly ?

In M/R we can change the Partitioner between mapper and reducer, how can i
do in Spark  for Join?


IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDurationGC
TimeShuffle Read Size / Records ▴Shuffle Spill (Memory)Shuffle Spill (Disk)
Errors  0 3618 0 RUNNING PROCESS_LOCAL 4 / host12015/05/04 05:09:53 33 min
8.5 min  783.9 MB / 50,761,322  4.6 GB 47.5 MB   433 4051 0 SUCCESS
PROCESS_LOCAL 1 / host2 2015/05/04 05:16:27 1.1 min  20 s  116.0 MB /
4505143  1282.3 MB 10.1 MB   218 3836 0 SUCCESS PROCESS_LOCAL 3 /
host3 2015/05/04
05:13:01 53 s  11 s  76.4 MB / 2865143  879.6 MB 6.9 MB   113 3731 0 SUCCESS
PROCESS_LOCAL 2 / host4 2015/05/04 05:11:30 31 s  8 s  6.9 MB / 5187  0.0 B 0.0
B

On Mon, May 4, 2015 at 6:00 PM, Saisai Shao sai.sai.s...@gmail.com wrote:

 From the symptoms you mentioned that one task's shuffle write is much
 larger than all the other task, it is quite similar to normal data skew
 behavior, I just give some advice based on your descriptions, I think you
 need to detect whether data is actually skewed or not.

 The shuffle will put data with same partitioner strategy (default is hash
 partitioner) into one task, so the same key data will be put into the same
 task, but one task not just has only one key.

 2015-05-04 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:

 Attached image shows the Spark UI for the job.





 On Mon, May 4, 2015 at 3:28 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Four tasks are now failing with

 IndexIDAttemptStatus ▾Locality LevelExecutor ID / HostLaunch Time
 DurationGC TimeShuffle Read Size / RecordsShuffle Spill (Memory)Shuffle
 Spill (Disk)Errors  0 3771 0 FAILED PROCESS_LOCAL 114 / host1 2015/05/04
 01:27:44   /   ExecutorLostFailure (executor 114 lost)  1007 4973 1
 FAILED PROCESS_LOCAL 420 / host2 2015/05/04 02:13:14   /   FetchFailed(null,
 shuffleId=1, mapId=-1, reduceId=1007, message= +details

 FetchFailed(null, shuffleId=1, mapId=-1, reduceId=1007, message=
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
 location for shuffle 1
 at 
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
 at 
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
 at 
 org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:381)
 at 
 org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:177)
 at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
 at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
 at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
 at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:127)
 at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 at 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
I tried this

val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))]
= lstgItem.join(viEvents, new org.apache.spark.RangePartitioner(partitions
= 1200, rdd = viEvents)).map {


It fired two jobs and still i have 1 task that never completes.
IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDurationGC
TimeShuffle Read Size / Records ▴Shuffle Spill (Memory)Shuffle Spill (Disk)
Errors  0 4818 0 RUNNING PROCESS_LOCAL 5 / host1 2015/05/04 07:24:25 1.1 h
13 min  778.0 MB / 50314161  4.5 GB 47.4 MB   955 5773 0 SUCCESS
PROCESS_LOCAL 5 / host2 2015/05/04 07:47:16 2.2 min  1.5 min  106.3 MB /
4197539  0.0 B 0.0 B   1199 6017 0 SUCCESS PROCESS_LOCAL 3 / host3 2015/05/04
07:51:51 48 s  2 s  94.2 MB / 3618335  2.8 GB 8.6 MB   216



2)
I tried reversing the datasets in join

val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))]
=viEvents.join(lstgItem)

This led to same problem of a long running task.
3)
Next, i am trying this

val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))]
= lstgItem.join(viEvents, 1200).map {


I have exhausted all my options.


Regards,

Deepak


On Mon, May 4, 2015 at 6:24 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I ran it against one file instead of 10 files and i see one task is still
 running after 33 mins its shuffle read size is 780MB/50 mil records.

 I did a count of records for each itemId from dataset-2 [One FILE] (Second
 Dataset (RDDPair) val viEvents = viEventsRaw.map { vi = (vi.get(14
 ).asInstanceOf[Long], vi) } ). This is the dataset that contains the list
 of items viewed by user in one day.

 *Item IdCount*
 201335783004 537
 111654496030 353
 141640703798 287
 191568402102 258
 111654479898 217
 231521843148 211
 251931716094 201
 111654493548 181
 181503913062 181
 121635453050 152
 261798565828 140
 151494617682 139
 251927181728 127
 231516683056 119
 141640492864 117
 161677270656 117
 171771073616 113
 111649942124 109
 191516989450 97
 231539161292 94
 221555628408 88
 131497785968 87
 121632233872 84
 131335379184 83
 281531363490 83
 131492727742 79
 231174157820 79
 161666914810 77
 251699753072 77
 161683664300 76


 I was assuming that data-skew would be if the top item(201335783004) had
 a count of 1 million, however its only few hundreds, then why is Spark
 skewing it in join ? What should i do that Spark distributes the records
 evenly ?

 In M/R we can change the Partitioner between mapper and reducer, how can i
 do in Spark  for Join?


 IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDurationGC
 TimeShuffle Read Size / Records ▴Shuffle Spill (Memory)Shuffle Spill
 (Disk)Errors  0 3618 0 RUNNING PROCESS_LOCAL 4 / host12015/05/04 05:09:53 33
 min  8.5 min  783.9 MB / 50,761,322  4.6 GB 47.5 MB   433 4051 0 SUCCESS
 PROCESS_LOCAL 1 / host2 2015/05/04 05:16:27 1.1 min  20 s  116.0 MB /
 4505143  1282.3 MB 10.1 MB   218 3836 0 SUCCESS PROCESS_LOCAL 3 / host3 
 2015/05/04
 05:13:01 53 s  11 s  76.4 MB / 2865143  879.6 MB 6.9 MB   113 3731 0
 SUCCESS PROCESS_LOCAL 2 / host4 2015/05/04 05:11:30 31 s  8 s  6.9 MB /
 5187  0.0 B 0.0 B

 On Mon, May 4, 2015 at 6:00 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 From the symptoms you mentioned that one task's shuffle write is much
 larger than all the other task, it is quite similar to normal data skew
 behavior, I just give some advice based on your descriptions, I think you
 need to detect whether data is actually skewed or not.

 The shuffle will put data with same partitioner strategy (default is hash
 partitioner) into one task, so the same key data will be put into the same
 task, but one task not just has only one key.

 2015-05-04 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:

 Attached image shows the Spark UI for the job.





 On Mon, May 4, 2015 at 3:28 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Four tasks are now failing with

 IndexIDAttemptStatus ▾Locality LevelExecutor ID / HostLaunch Time
 DurationGC TimeShuffle Read Size / RecordsShuffle Spill (Memory)Shuffle
 Spill (Disk)Errors  0 3771 0 FAILED PROCESS_LOCAL 114 / host1 2015/05/04
 01:27:44   /   ExecutorLostFailure (executor 114 lost)  1007 4973 1
 FAILED PROCESS_LOCAL 420 / host2 2015/05/04 02:13:14   /   
 FetchFailed(null,
 shuffleId=1, mapId=-1, reduceId=1007, message= +details

 FetchFailed(null, shuffleId=1, mapId=-1, reduceId=1007, message=
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
 location for shuffle 1
at 
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
at 
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
Data Set 1 : viEvents : Is the event activity data of 1 day. I took 10
files out of it and 10 records

*Item ID Count*
 201335783004 3419  191568402102 1793  111654479898 1362  181503913062
1310  261798565828 1028  111654493548 994  231516683056 862  131497785968
746  161666914810 633  221749455474 432  201324502754 410  201334042634 402
191562605592 380  271841178238 362  161663339210 344  251615941886 313
261855748678 309  271821726658 255  111657099518 224  261868369938 218
181725710132 216  171766164072 215  221757076934 213  171763906872 212
111650132368 206  181629904282 204  261867932788 198  161668475280 194
191398227282 194





Data set 2:
ItemID Count
2217305702 1
3842604614 1
4463421160 1
4581260446 1
4632783223 1
4645316947 1
4760829454 1
4786989430 1
5530758430 1
5610056107 1
5661929425 1
5953801612 1
6141607456 1
6197204992 1
6220704442 1
6271022614 1
6282402871 1
6525123621 1
6554834772 1
6566297541 1
This data set will always have only one element for each item as it
contains metadata information.

Given the nature of these two datasets, if at all there is skewness then it
must be with dataset1. In dataset1 the top 20-30 records do not have record
count for a given itemID (shuffle key) greater than 3000 and that is very
small.

Why am i still *not* able to do a join of these two datasets given i have
unlimited capacity, repartitioning but 12G memory limit on each node.
Each time i get a task that runs forever and it process roughly around 1.5G
data when others are processing few MBs. Also 1.5G data (shuffle read size)
is very small.

Please suggest.


On Mon, May 4, 2015 at 9:08 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I tried this

 val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary,
 Long))] = lstgItem.join(viEvents, new
 org.apache.spark.RangePartitioner(partitions = 1200, rdd = viEvents)).map
 {


 It fired two jobs and still i have 1 task that never completes.
 IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDurationGC
 TimeShuffle Read Size / Records ▴Shuffle Spill (Memory)Shuffle Spill
 (Disk)Errors  0 4818 0 RUNNING PROCESS_LOCAL 5 / host1 2015/05/04 07:24:25 1.1
 h  13 min  778.0 MB / 50314161  4.5 GB 47.4 MB   955 5773 0 SUCCESS
 PROCESS_LOCAL 5 / host2 2015/05/04 07:47:16 2.2 min  1.5 min  106.3 MB /
 4197539  0.0 B 0.0 B   1199 6017 0 SUCCESS PROCESS_LOCAL 3 / host3 2015/05/04
 07:51:51 48 s  2 s  94.2 MB / 3618335  2.8 GB 8.6 MB   216



 2)
 I tried reversing the datasets in join

 val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary,
 Long))] =viEvents.join(lstgItem)

 This led to same problem of a long running task.
 3)
 Next, i am trying this

 val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary,
 Long))] = lstgItem.join(viEvents, 1200).map {


 I have exhausted all my options.


 Regards,

 Deepak


 On Mon, May 4, 2015 at 6:24 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I ran it against one file instead of 10 files and i see one task is still
 running after 33 mins its shuffle read size is 780MB/50 mil records.

 I did a count of records for each itemId from dataset-2 [One FILE] (Second
 Dataset (RDDPair) val viEvents = viEventsRaw.map { vi = (vi.get(14
 ).asInstanceOf[Long], vi) } ). This is the dataset that contains the
 list of items viewed by user in one day.

 *Item IdCount*
 201335783004 537
 111654496030 353
 141640703798 287
 191568402102 258
 111654479898 217
 231521843148 211
 251931716094 201
 111654493548 181
 181503913062 181
 121635453050 152
 261798565828 140
 151494617682 139
 251927181728 127
 231516683056 119
 141640492864 117
 161677270656 117
 171771073616 113
 111649942124 109
 191516989450 97
 231539161292 94
 221555628408 88
 131497785968 87
 121632233872 84
 131335379184 83
 281531363490 83
 131492727742 79
 231174157820 79
 161666914810 77
 251699753072 77
 161683664300 76


 I was assuming that data-skew would be if the top item(201335783004) had
 a count of 1 million, however its only few hundreds, then why is Spark
 skewing it in join ? What should i do that Spark distributes the records
 evenly ?

 In M/R we can change the Partitioner between mapper and reducer, how can
 i do in Spark  for Join?


 IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDurationGC
 TimeShuffle Read Size / Records ▴Shuffle Spill (Memory)Shuffle Spill
 (Disk)Errors  0 3618 0 RUNNING PROCESS_LOCAL 4 / host12015/05/04 05:09:53 33
 min  8.5 min  783.9 MB / 50,761,322  4.6 GB 47.5 MB   433 4051 0 SUCCESS
 PROCESS_LOCAL 1 / host2 2015/05/04 05:16:27 1.1 min  20 s  116.0 MB /
 4505143  1282.3 MB 10.1 MB   218 3836 0 SUCCESS PROCESS_LOCAL 3 / host3 
 2015/05/04
 05:13:01 53 s  11 s  76.4 MB / 2865143  879.6 MB 6.9 MB   113 3731 0
 SUCCESS PROCESS_LOCAL 2 / host4 2015/05/04 05:11:30 31 s  8 s  6.9 MB /
 5187  0.0 B 0.0 B

 On Mon, May 4, 2015 at 6:00 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 From the symptoms you mentioned that one 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread Dean Wampler
I don't know the full context of what you're doing, but serialization
errors usually mean you're attempting to serialize something that can't be
serialized, like the SparkContext. Kryo won't help there.

The arguments to spark-submit you posted previously look good:

2)  --num-executors 96 --driver-memory 12g --driver-java-options
-XX:MaxPermSize=10G --executor-memory 12g --executor-cores 4

I suspect you aren't getting the parallelism you need. For partitioning, if
your data is in HDFS and your block size is 128MB, then you'll get ~195
partitions anyway. If it takes 7 hours to do a join over 25GB of data, you
have some other serious bottleneck. You should examine the web console and
the logs to determine where all the time is going. Questions you might
pursue:

   - How long does each task take to complete?
   - How many of those 195 partitions/tasks are processed at the same time?
   That is, how many slots are available?  Maybe you need more nodes if the
   number of slots is too low. Based on your command arguments, you should be
   able to process 1/2 of them at a time, unless the cluster is busy.
   - Is the cluster swamped with other work?
   - How much data does each task process? Is the data roughly the same
   from one task to the next? If not, then you might have serious key skew?

You may also need to research the details of how joins are implemented and
some of the common tricks for organizing data to minimize having to shuffle
all N by M records.



Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Sun, May 3, 2015 at 11:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Hello Deam,
 If I don;t use Kryo serializer i got Serialization error and hence am
 using it.
 If I don';t use partitionBy/reparition then the simply join never
 completed even after 7 hours and infact as next step i need to run it
 against 250G as that is my full dataset size. Someone here suggested to me
 to use repartition.

 Assuming reparition is mandatory , how do i decide whats the right number
 ? When i am using 400 i do not get NullPointerException that i talked
 about, which is strange. I never saw that exception against small random
 dataset but see it with 25G and again with 400 partitions , i do not see it.


 On Sun, May 3, 2015 at 9:15 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 IMHO, you are trying waaay to hard to optimize work on what is really a
 small data set. 25G, even 250G, is not that much data, especially if you've
 spent a month trying to get something to work that should be simple. All
 these errors are from optimization attempts.

 Kryo is great, but if it's not working reliably for some reason, then
 don't use it. Rather than force 200 partitions, let Spark try to figure out
 a good-enough number. (If you really need to force a partition count, use
 the repartition method instead, unless you're overriding the partitioner.)

 So. I recommend that you eliminate all the optimizations: Kryo,
 partitionBy, etc. Just use the simplest code you can. Make it work first.
 Then, if it really isn't fast enough, look for actual evidence of
 bottlenecks and optimize those.



 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Sun, May 3, 2015 at 10:22 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Hello Dean  Others,
 Thanks for your suggestions.
 I have two data sets and all i want to do is a simple equi join. I have
 10G limit and as my dataset_1 exceeded that it was throwing OOM error.
 Hence i switched back to use .join() API instead of map-side broadcast
 join.
 I am repartitioning the data with 100,200 and i see a
 NullPointerException now.

 When i run against 25G of each input and with .partitionBy(new
 org.apache.spark.HashPartitioner(200)) , I see NullPointerExveption


 this trace does not include a line from my code and hence i do not what
 is causing error ?
 I do have registered kryo serializer.

 val conf = new SparkConf()
   .setAppName(detail)
 *  .set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer)*
   .set(spark.kryoserializer.buffer.mb,
 arguments.get(buffersize).get)
   .set(spark.kryoserializer.buffer.max.mb,
 arguments.get(maxbuffersize).get)
   .set(spark.driver.maxResultSize,
 arguments.get(maxResultSize).get)
   .set(spark.yarn.maxAppAttempts, 0)
 * 
 .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLeve*
 lMetricSum]))
 val sc = new SparkContext(conf)

 I see the exception when this task runs

 val viEvents = details.map { vi = (vi.get(14).asInstanceOf[Long], vi) }

 Its a simple mapping of input records to (itemId, record)

 I found this

 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread Dean Wampler
How big is the data you're returning to the driver with collectAsMap? You
are probably running out of memory trying to copy too much data back to it.

If you're trying to force a map-side join, Spark can do that for you in
some cases within the regular DataFrame/RDD context. See
http://spark.apache.org/docs/latest/sql-programming-guide.html#performance-tuning
and this talk by Michael Armbrust for example,
http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-Michael-Armbrust.pdf.


dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Thu, Apr 30, 2015 at 12:40 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Full Exception
 *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Stage 1 (collectAsMap at
 VISummaryDataProvider.scala:37) failed in 884.087 s*
 *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Job 0 failed: collectAsMap
 at VISummaryDataProvider.scala:37, took 1093.418249 s*
 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw
 exception: Job aborted due to stage failure: Exception while getting task
 result: org.apache.spark.SparkException: Error sending message [message =
 GetLocations(taskresult_112)]
 org.apache.spark.SparkException: Job aborted due to stage failure:
 Exception while getting task result: org.apache.spark.SparkException: Error
 sending message [message = GetLocations(taskresult_112)]
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/30 09:59:49 INFO yarn.ApplicationMaster: Final app status: FAILED,
 exitCode: 15, (reason: User class threw exception: Job aborted due to stage
 failure: Exception while getting task result:
 org.apache.spark.SparkException: Error sending message [message =
 GetLocations(taskresult_112)])


 *Code at line 37*

 val lstgItemMap = listings.map { lstg = (lstg.getItemId().toLong, lstg) }
 .collectAsMap

 Listing data set size is 26G (10 files) and my driver memory is 12G (I
 cant go beyond it). The reason i do collectAsMap is to brodcast it and do a
 map-side join instead of regular join.


 Please suggest ?


 On Thu, Apr 30, 2015 at 10:52 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 My Spark Job is failing  and i see

 ==

 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw
 exception: Job aborted due to stage failure: Exception while getting task
 result: org.apache.spark.SparkException: Error sending message [message =
 GetLocations(taskresult_112)]

 org.apache.spark.SparkException: Job aborted due to stage failure:
 Exception while getting task result: org.apache.spark.SparkException: Error
 sending message [message = GetLocations(taskresult_112)]

 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)


 java.util.concurrent.TimeoutException: Futures timed out after [30
 seconds]


 I see 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread Dean Wampler
IMHO, you are trying waaay to hard to optimize work on what is really a
small data set. 25G, even 250G, is not that much data, especially if you've
spent a month trying to get something to work that should be simple. All
these errors are from optimization attempts.

Kryo is great, but if it's not working reliably for some reason, then don't
use it. Rather than force 200 partitions, let Spark try to figure out a
good-enough number. (If you really need to force a partition count, use the
repartition method instead, unless you're overriding the partitioner.)

So. I recommend that you eliminate all the optimizations: Kryo,
partitionBy, etc. Just use the simplest code you can. Make it work first.
Then, if it really isn't fast enough, look for actual evidence of
bottlenecks and optimize those.



Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Sun, May 3, 2015 at 10:22 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Hello Dean  Others,
 Thanks for your suggestions.
 I have two data sets and all i want to do is a simple equi join. I have
 10G limit and as my dataset_1 exceeded that it was throwing OOM error.
 Hence i switched back to use .join() API instead of map-side broadcast
 join.
 I am repartitioning the data with 100,200 and i see a NullPointerException
 now.

 When i run against 25G of each input and with .partitionBy(new
 org.apache.spark.HashPartitioner(200)) , I see NullPointerExveption


 this trace does not include a line from my code and hence i do not what is
 causing error ?
 I do have registered kryo serializer.

 val conf = new SparkConf()
   .setAppName(detail)
 *  .set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer)*
   .set(spark.kryoserializer.buffer.mb,
 arguments.get(buffersize).get)
   .set(spark.kryoserializer.buffer.max.mb,
 arguments.get(maxbuffersize).get)
   .set(spark.driver.maxResultSize,
 arguments.get(maxResultSize).get)
   .set(spark.yarn.maxAppAttempts, 0)
 * 
 .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLeve*
 lMetricSum]))
 val sc = new SparkContext(conf)

 I see the exception when this task runs

 val viEvents = details.map { vi = (vi.get(14).asInstanceOf[Long], vi) }

 Its a simple mapping of input records to (itemId, record)

 I found this

 http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist
 and

 http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html

 Looks like Kryo (2.21v)  changed something to stop using default
 constructors.

 (Kryo.DefaultInstantiatorStrategy) 
 kryo.getInstantiatorStrategy()).setFallbackInstantiatorStrategy(new 
 StdInstantiatorStrategy());


 Please suggest


 Trace:
 15/05/01 03:02:15 ERROR executor.Executor: Exception in task 110.1 in
 stage 2.0 (TID 774)
 com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
 Serialization trace:
 values (org.apache.avro.generic.GenericData$Record)
 datum (org.apache.avro.mapred.AvroKey)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:41)
 at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 Regards,


 Any suggestions.
 I am not able to get this thing to work over a month now, its kind of
 getting frustrating.

 On Sun, May 3, 2015 at 8:03 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 How big is the data you're returning to the driver with collectAsMap? You
 are probably running out of memory trying to copy too much data back to it.

 If you're trying to force a map-side join, Spark can do that for you in
 some cases within the regular DataFrame/RDD context. See
 http://spark.apache.org/docs/latest/sql-programming-guide.html#performance-tuning
 and this talk by Michael Armbrust for example,
 http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-Michael-Armbrust.pdf.


 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 30, 2015 at 12:40 PM, ÐΞ€ρ@Ҝ 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread ๏̯͡๏
Hello Dean  Others,
Thanks for your suggestions.
I have two data sets and all i want to do is a simple equi join. I have 10G
limit and as my dataset_1 exceeded that it was throwing OOM error. Hence i
switched back to use .join() API instead of map-side broadcast join.
I am repartitioning the data with 100,200 and i see a NullPointerException
now.

When i run against 25G of each input and with .partitionBy(new
org.apache.spark.HashPartitioner(200)) , I see NullPointerExveption


this trace does not include a line from my code and hence i do not what is
causing error ?
I do have registered kryo serializer.

val conf = new SparkConf()
  .setAppName(detail)
*  .set(spark.serializer,
org.apache.spark.serializer.KryoSerializer)*
  .set(spark.kryoserializer.buffer.mb,
arguments.get(buffersize).get)
  .set(spark.kryoserializer.buffer.max.mb,
arguments.get(maxbuffersize).get)
  .set(spark.driver.maxResultSize, arguments.get(maxResultSize).get)
  .set(spark.yarn.maxAppAttempts, 0)
* 
.registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLeve*
lMetricSum]))
val sc = new SparkContext(conf)

I see the exception when this task runs

val viEvents = details.map { vi = (vi.get(14).asInstanceOf[Long], vi) }

Its a simple mapping of input records to (itemId, record)

I found this
http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist
and
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html

Looks like Kryo (2.21v)  changed something to stop using default
constructors.

(Kryo.DefaultInstantiatorStrategy)
kryo.getInstantiatorStrategy()).setFallbackInstantiatorStrategy(new
StdInstantiatorStrategy());


Please suggest


Trace:
15/05/01 03:02:15 ERROR executor.Executor: Exception in task 110.1 in stage
2.0 (TID 774)
com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
values (org.apache.avro.generic.GenericData$Record)
datum (org.apache.avro.mapred.AvroKey)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:41)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
Regards,


Any suggestions.
I am not able to get this thing to work over a month now, its kind of
getting frustrating.

On Sun, May 3, 2015 at 8:03 PM, Dean Wampler deanwamp...@gmail.com wrote:

 How big is the data you're returning to the driver with collectAsMap? You
 are probably running out of memory trying to copy too much data back to it.

 If you're trying to force a map-side join, Spark can do that for you in
 some cases within the regular DataFrame/RDD context. See
 http://spark.apache.org/docs/latest/sql-programming-guide.html#performance-tuning
 and this talk by Michael Armbrust for example,
 http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-Michael-Armbrust.pdf.


 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Thu, Apr 30, 2015 at 12:40 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Full Exception
 *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Stage 1 (collectAsMap at
 VISummaryDataProvider.scala:37) failed in 884.087 s*
 *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Job 0 failed:
 collectAsMap at VISummaryDataProvider.scala:37, took 1093.418249 s*
 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw
 exception: Job aborted due to stage failure: Exception while getting task
 result: org.apache.spark.SparkException: Error sending message [message =
 GetLocations(taskresult_112)]
 org.apache.spark.SparkException: Job aborted due to stage failure:
 Exception while getting task result: org.apache.spark.SparkException: Error
 sending message [message = GetLocations(taskresult_112)]
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at
 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread ๏̯͡๏
Hello Deam,
If I don;t use Kryo serializer i got Serialization error and hence am using
it.
If I don';t use partitionBy/reparition then the simply join never completed
even after 7 hours and infact as next step i need to run it against 250G as
that is my full dataset size. Someone here suggested to me to use
repartition.

Assuming reparition is mandatory , how do i decide whats the right number ?
When i am using 400 i do not get NullPointerException that i talked about,
which is strange. I never saw that exception against small random dataset
but see it with 25G and again with 400 partitions , i do not see it.


On Sun, May 3, 2015 at 9:15 PM, Dean Wampler deanwamp...@gmail.com wrote:

 IMHO, you are trying waaay to hard to optimize work on what is really a
 small data set. 25G, even 250G, is not that much data, especially if you've
 spent a month trying to get something to work that should be simple. All
 these errors are from optimization attempts.

 Kryo is great, but if it's not working reliably for some reason, then
 don't use it. Rather than force 200 partitions, let Spark try to figure out
 a good-enough number. (If you really need to force a partition count, use
 the repartition method instead, unless you're overriding the partitioner.)

 So. I recommend that you eliminate all the optimizations: Kryo,
 partitionBy, etc. Just use the simplest code you can. Make it work first.
 Then, if it really isn't fast enough, look for actual evidence of
 bottlenecks and optimize those.



 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Sun, May 3, 2015 at 10:22 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Hello Dean  Others,
 Thanks for your suggestions.
 I have two data sets and all i want to do is a simple equi join. I have
 10G limit and as my dataset_1 exceeded that it was throwing OOM error.
 Hence i switched back to use .join() API instead of map-side broadcast
 join.
 I am repartitioning the data with 100,200 and i see a
 NullPointerException now.

 When i run against 25G of each input and with .partitionBy(new
 org.apache.spark.HashPartitioner(200)) , I see NullPointerExveption


 this trace does not include a line from my code and hence i do not what
 is causing error ?
 I do have registered kryo serializer.

 val conf = new SparkConf()
   .setAppName(detail)
 *  .set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer)*
   .set(spark.kryoserializer.buffer.mb,
 arguments.get(buffersize).get)
   .set(spark.kryoserializer.buffer.max.mb,
 arguments.get(maxbuffersize).get)
   .set(spark.driver.maxResultSize,
 arguments.get(maxResultSize).get)
   .set(spark.yarn.maxAppAttempts, 0)
 * 
 .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLeve*
 lMetricSum]))
 val sc = new SparkContext(conf)

 I see the exception when this task runs

 val viEvents = details.map { vi = (vi.get(14).asInstanceOf[Long], vi) }

 Its a simple mapping of input records to (itemId, record)

 I found this

 http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist
 and

 http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html

 Looks like Kryo (2.21v)  changed something to stop using default
 constructors.

 (Kryo.DefaultInstantiatorStrategy) 
 kryo.getInstantiatorStrategy()).setFallbackInstantiatorStrategy(new 
 StdInstantiatorStrategy());


 Please suggest


 Trace:
 15/05/01 03:02:15 ERROR executor.Executor: Exception in task 110.1 in
 stage 2.0 (TID 774)
 com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
 Serialization trace:
 values (org.apache.avro.generic.GenericData$Record)
 datum (org.apache.avro.mapred.AvroKey)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:41)
 at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 Regards,


 Any suggestions.
 I am not able to get this thing to work over a month now, its kind of
 getting frustrating.

 On Sun, May 3, 2015 at 8:03 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 How big is the data you're returning to the driver with collectAsMap?
 You are probably running out of memory trying to 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-02 Thread Akhil Das
You could try repartitioning your listings RDD, also doing a collectAsMap
would basically bring all your data to driver, in that case you might want
to set the storage level as Memory and disk not sure that will do any help
on the driver though.

Thanks
Best Regards

On Thu, Apr 30, 2015 at 11:10 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 Full Exception
 *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Stage 1 (collectAsMap at
 VISummaryDataProvider.scala:37) failed in 884.087 s*
 *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Job 0 failed: collectAsMap
 at VISummaryDataProvider.scala:37, took 1093.418249 s*
 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw
 exception: Job aborted due to stage failure: Exception while getting task
 result: org.apache.spark.SparkException: Error sending message [message =
 GetLocations(taskresult_112)]
 org.apache.spark.SparkException: Job aborted due to stage failure:
 Exception while getting task result: org.apache.spark.SparkException: Error
 sending message [message = GetLocations(taskresult_112)]
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/30 09:59:49 INFO yarn.ApplicationMaster: Final app status: FAILED,
 exitCode: 15, (reason: User class threw exception: Job aborted due to stage
 failure: Exception while getting task result:
 org.apache.spark.SparkException: Error sending message [message =
 GetLocations(taskresult_112)])


 *Code at line 37*

 val lstgItemMap = listings.map { lstg = (lstg.getItemId().toLong, lstg) }
 .collectAsMap

 Listing data set size is 26G (10 files) and my driver memory is 12G (I
 cant go beyond it). The reason i do collectAsMap is to brodcast it and do a
 map-side join instead of regular join.


 Please suggest ?


 On Thu, Apr 30, 2015 at 10:52 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 My Spark Job is failing  and i see

 ==

 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw
 exception: Job aborted due to stage failure: Exception while getting task
 result: org.apache.spark.SparkException: Error sending message [message =
 GetLocations(taskresult_112)]

 org.apache.spark.SparkException: Job aborted due to stage failure:
 Exception while getting task result: org.apache.spark.SparkException: Error
 sending message [message = GetLocations(taskresult_112)]

 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)


 java.util.concurrent.TimeoutException: Futures timed out after [30
 seconds]


 I see multiple of these

 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [30 seconds]

 And finally i see this
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
 at
 org.apache.spark.network.BlockTransferService$$anon$1.onBlockFetchSuccess(BlockTransferService.scala:95)
 at
 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-04-30 Thread Akhil Das
You could try increasing your heap space explicitly. like export
_JAVA_OPTIONS=-Xmx10g, its not the correct approach but try.

Thanks
Best Regards

On Tue, Apr 28, 2015 at 10:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I have a SparkApp that runs completes in 45 mins for 5 files (5*750MB
 size) and it takes 16 executors to do so.

 I wanted to run it against 10 files of each input type (10*3 files as
 there are three inputs that are transformed). [Input1 = 10*750 MB,
 Input2=10*2.5GB, Input3 = 10*1.5G], Hence i used 32 executors.

 I see multiple
 5/04/28 09:23:31 WARN executor.Executor: Issue communicating with driver
 in heartbeater
 org.apache.spark.SparkException: Error sending message [message =
 Heartbeat(22,[Lscala.Tuple2;@2e4c404a,BlockManagerId(22,
 phxaishdc9dn1048.stratus.phx.ebay.com, 39505))]
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:209)
 at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [30 seconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
 ... 1 more


 When i searched deeper, i found OOM error.
 15/04/28 09:10:15 INFO storage.BlockManagerMasterActor: Removing block
 manager BlockManagerId(17, phxdpehdc9dn2643.stratus.phx.ebay.com, 36819)
 15/04/28 09:11:26 WARN storage.BlockManagerMasterActor: Removing
 BlockManager BlockManagerId(9, phxaishdc9dn1783.stratus.phx.ebay.com,
 48304) with no recent heart beats: 121200ms exceeds 12ms
 15/04/28 09:11:26 INFO storage.BlockManagerMasterActor: Removing block
 manager BlockManagerId(9, phxaishdc9dn1783.stratus.phx.ebay.com, 48304)
 15/04/28 09:11:26 ERROR util.Utils: Uncaught exception in thread
 task-result-getter-3
 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2245)
 at java.util.Arrays.copyOf(Arrays.java:2219)
 at java.util.ArrayList.grow(ArrayList.java:242)
 at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
 at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
 at java.util.ArrayList.add(ArrayList.java:440)
 at
 com.esotericsoftware.kryo.util.MapReferenceResolver.nextReadId(MapReferenceResolver.java:33)
 at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:766)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:727)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:173)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:621)
 at
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:379)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError:
 Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2245)
 at java.util.Arrays.copyOf(Arrays.java:2219)
 at java.util.ArrayList.grow(ArrayList.java:242)
 at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
 at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
 at java.util.ArrayList.add(ArrayList.java:440)
 at
 com.esotericsoftware.kryo.util.MapReferenceResolver.nextReadId(MapReferenceResolver.java:33)
 at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:766)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:727)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at
 

Re: Spark - Timeout Issues - OutOfMemoryError

2015-04-30 Thread ๏̯͡๏
Did not work. Same problem.



On Thu, Apr 30, 2015 at 1:28 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 You could try increasing your heap space explicitly. like export
 _JAVA_OPTIONS=-Xmx10g, its not the correct approach but try.

 Thanks
 Best Regards

 On Tue, Apr 28, 2015 at 10:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 I have a SparkApp that runs completes in 45 mins for 5 files (5*750MB
 size) and it takes 16 executors to do so.

 I wanted to run it against 10 files of each input type (10*3 files as
 there are three inputs that are transformed). [Input1 = 10*750 MB,
 Input2=10*2.5GB, Input3 = 10*1.5G], Hence i used 32 executors.

 I see multiple
 5/04/28 09:23:31 WARN executor.Executor: Issue communicating with driver
 in heartbeater
 org.apache.spark.SparkException: Error sending message [message =
 Heartbeat(22,[Lscala.Tuple2;@2e4c404a,BlockManagerId(22,
 phxaishdc9dn1048.stratus.phx.ebay.com, 39505))]
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:209)
 at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [30 seconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
 ... 1 more


 When i searched deeper, i found OOM error.
 15/04/28 09:10:15 INFO storage.BlockManagerMasterActor: Removing block
 manager BlockManagerId(17, phxdpehdc9dn2643.stratus.phx.ebay.com, 36819)
 15/04/28 09:11:26 WARN storage.BlockManagerMasterActor: Removing
 BlockManager BlockManagerId(9, phxaishdc9dn1783.stratus.phx.ebay.com,
 48304) with no recent heart beats: 121200ms exceeds 12ms
 15/04/28 09:11:26 INFO storage.BlockManagerMasterActor: Removing block
 manager BlockManagerId(9, phxaishdc9dn1783.stratus.phx.ebay.com, 48304)
 15/04/28 09:11:26 ERROR util.Utils: Uncaught exception in thread
 task-result-getter-3
 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2245)
 at java.util.Arrays.copyOf(Arrays.java:2219)
 at java.util.ArrayList.grow(ArrayList.java:242)
 at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
 at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
 at java.util.ArrayList.add(ArrayList.java:440)
 at
 com.esotericsoftware.kryo.util.MapReferenceResolver.nextReadId(MapReferenceResolver.java:33)
 at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:766)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:727)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 at
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:173)
 at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 at
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:621)
 at
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:379)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Exception in thread task-result-getter-3 java.lang.OutOfMemoryError:
 Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2245)
 at java.util.Arrays.copyOf(Arrays.java:2219)
 at java.util.ArrayList.grow(ArrayList.java:242)
 at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
 at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
 at java.util.ArrayList.add(ArrayList.java:440)
 at
 com.esotericsoftware.kryo.util.MapReferenceResolver.nextReadId(MapReferenceResolver.java:33)
 at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:766)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:727)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)

Re: Spark - Timeout Issues - OutOfMemoryError

2015-04-30 Thread ๏̯͡๏
Full Exception
*15/04/30 09:59:49 INFO scheduler.DAGScheduler: Stage 1 (collectAsMap at
VISummaryDataProvider.scala:37) failed in 884.087 s*
*15/04/30 09:59:49 INFO scheduler.DAGScheduler: Job 0 failed: collectAsMap
at VISummaryDataProvider.scala:37, took 1093.418249 s*
15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw exception:
Job aborted due to stage failure: Exception while getting task result:
org.apache.spark.SparkException: Error sending message [message =
GetLocations(taskresult_112)]
org.apache.spark.SparkException: Job aborted due to stage failure:
Exception while getting task result: org.apache.spark.SparkException: Error
sending message [message = GetLocations(taskresult_112)]
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/04/30 09:59:49 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception: Job aborted due to stage
failure: Exception while getting task result:
org.apache.spark.SparkException: Error sending message [message =
GetLocations(taskresult_112)])


*Code at line 37*

val lstgItemMap = listings.map { lstg = (lstg.getItemId().toLong, lstg) }
.collectAsMap

Listing data set size is 26G (10 files) and my driver memory is 12G (I cant
go beyond it). The reason i do collectAsMap is to brodcast it and do a
map-side join instead of regular join.


Please suggest ?


On Thu, Apr 30, 2015 at 10:52 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 My Spark Job is failing  and i see

 ==

 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw
 exception: Job aborted due to stage failure: Exception while getting task
 result: org.apache.spark.SparkException: Error sending message [message =
 GetLocations(taskresult_112)]

 org.apache.spark.SparkException: Job aborted due to stage failure:
 Exception while getting task result: org.apache.spark.SparkException: Error
 sending message [message = GetLocations(taskresult_112)]

 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)

 at scala.Option.foreach(Option.scala:236)

 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)


 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]


 I see multiple of these

 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [30 seconds]

 And finally i see this
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
 at
 org.apache.spark.network.BlockTransferService$$anon$1.onBlockFetchSuccess(BlockTransferService.scala:95)
 at
 org.apache.spark.network.shuffle.RetryingBlockFetcher$RetryingBlockFetchListener.onBlockFetchSuccess(RetryingBlockFetcher.java:206)
 at
 org.apache.spark.network.shuffle.OneForOneBlockFetcher$ChunkCallback.onSuccess(OneForOneBlockFetcher.java:72)
 at
 org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:124)
 at
 org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:93)
 at
 

Spark - Timeout Issues - OutOfMemoryError

2015-04-28 Thread ๏̯͡๏
I have a SparkApp that runs completes in 45 mins for 5 files (5*750MB size)
and it takes 16 executors to do so.

I wanted to run it against 10 files of each input type (10*3 files as there
are three inputs that are transformed). [Input1 = 10*750 MB,
Input2=10*2.5GB, Input3 = 10*1.5G], Hence i used 32 executors.

I see multiple
5/04/28 09:23:31 WARN executor.Executor: Issue communicating with driver in
heartbeater
org.apache.spark.SparkException: Error sending message [message =
Heartbeat(22,[Lscala.Tuple2;@2e4c404a,BlockManagerId(22,
phxaishdc9dn1048.stratus.phx.ebay.com, 39505))]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:209)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
... 1 more


When i searched deeper, i found OOM error.
15/04/28 09:10:15 INFO storage.BlockManagerMasterActor: Removing block
manager BlockManagerId(17, phxdpehdc9dn2643.stratus.phx.ebay.com, 36819)
15/04/28 09:11:26 WARN storage.BlockManagerMasterActor: Removing
BlockManager BlockManagerId(9, phxaishdc9dn1783.stratus.phx.ebay.com,
48304) with no recent heart beats: 121200ms exceeds 12ms
15/04/28 09:11:26 INFO storage.BlockManagerMasterActor: Removing block
manager BlockManagerId(9, phxaishdc9dn1783.stratus.phx.ebay.com, 48304)
15/04/28 09:11:26 ERROR util.Utils: Uncaught exception in thread
task-result-getter-3
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2245)
at java.util.Arrays.copyOf(Arrays.java:2219)
at java.util.ArrayList.grow(ArrayList.java:242)
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
at java.util.ArrayList.add(ArrayList.java:440)
at
com.esotericsoftware.kryo.util.MapReferenceResolver.nextReadId(MapReferenceResolver.java:33)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:766)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:727)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:173)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
at
org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:621)
at
org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:379)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Exception in thread task-result-getter-3 java.lang.OutOfMemoryError: Java
heap space
at java.util.Arrays.copyOf(Arrays.java:2245)
at java.util.Arrays.copyOf(Arrays.java:2219)
at java.util.ArrayList.grow(ArrayList.java:242)
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
at java.util.ArrayList.add(ArrayList.java:440)
at
com.esotericsoftware.kryo.util.MapReferenceResolver.nextReadId(MapReferenceResolver.java:33)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:766)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:727)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:173)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
at

  1   2   >