Dear all, 

Unfortunately I've not got ant respond in users forum. That's why I decided
to publish this question here.
We encountered problems of failed jobs with huge amount of data. For
example, an application works perfectly with relative small sized data, but
when it grows in 2 times this  application fails.

A simple local test was prepared for this question at
https://gist.github.com/copy-of-rezo/6a137e13a1e4f841e7eb
It generates 2 sets of key-value pairs, join them, selects distinct values
and counts data finally. 

object Spill { 
  def generate = { 
    for{ 
      j <- 1 to 10 
      i <- 1 to 200 
    } yield(j, i) 
  } 
  
  def main(args: Array[String]) { 
    val conf = new SparkConf().setAppName(getClass.getSimpleName) 
    conf.set("spark.shuffle.spill", "true") 
    conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") 
    val sc = new SparkContext(conf) 
    println(generate) 
  
    val dataA = sc.parallelize(generate) 
    val dataB = sc.parallelize(generate) 
    val dst = dataA.join(dataB).distinct().count() 
    println(dst) 
  } 
} 

We compiled it locally and run 3 times with different settings of memory: 
1) --executor-memory 10M --driver-memory 10M --num-executors 1
--executor-cores 1
It fails wtih "java.lang.OutOfMemoryError: GC overhead limit exceeded" at 
..... 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
 

2) --executor-memory 20M --driver-memory 20M --num-executors 1
--executor-cores 1
It works OK 

3)  --executor-memory 10M --driver-memory 10M --num-executors 1
--executor-cores 1 But let's make less data for i from 200 to 100. It
reduces input data in 2 times and joined data in 4 times 

  def generate = { 
    for{ 
      j <- 1 to 10 
      i <- 1 to 100   // previous value was 200 
    } yield(j, i) 
  } 
This code works OK. 

We don't understand why 10M is not enough for such simple operation with
32000 bytes of ints (2 * 10 * 200 * 2 * 4) approximately? 10M of RAM works
if we change the data volume in 2 times (2000 of records of (int, int)).   
Why spilling to disk doesn't cover this case? 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-at-simple-local-test-tp9490.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to