10M is tiny compared to all of the overhead of running a lot complex Scala
based app in a JVM. I think you may be bumping up against practical minimum
sizes and that you may find it is not really the data size? I don't think
it really scales down this far.
On Nov 22, 2014 2:24 PM, "rzykov" <rzy...@gmail.com> wrote:

> Dear all,
>
> Unfortunately I've not got ant respond in users forum. That's why I decided
> to publish this question here.
> We encountered problems of failed jobs with huge amount of data. For
> example, an application works perfectly with relative small sized data, but
> when it grows in 2 times this  application fails.
>
> A simple local test was prepared for this question at
> https://gist.github.com/copy-of-rezo/6a137e13a1e4f841e7eb
> It generates 2 sets of key-value pairs, join them, selects distinct values
> and counts data finally.
>
> object Spill {
>   def generate = {
>     for{
>       j <- 1 to 10
>       i <- 1 to 200
>     } yield(j, i)
>   }
>
>   def main(args: Array[String]) {
>     val conf = new SparkConf().setAppName(getClass.getSimpleName)
>     conf.set("spark.shuffle.spill", "true")
>     conf.set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")
>     val sc = new SparkContext(conf)
>     println(generate)
>
>     val dataA = sc.parallelize(generate)
>     val dataB = sc.parallelize(generate)
>     val dst = dataA.join(dataB).distinct().count()
>     println(dst)
>   }
> }
>
> We compiled it locally and run 3 times with different settings of memory:
> 1) --executor-memory 10M --driver-memory 10M --num-executors 1
> --executor-cores 1
> It fails wtih "java.lang.OutOfMemoryError: GC overhead limit exceeded" at
> .....
>
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
>
> 2) --executor-memory 20M --driver-memory 20M --num-executors 1
> --executor-cores 1
> It works OK
>
> 3)  --executor-memory 10M --driver-memory 10M --num-executors 1
> --executor-cores 1 But let's make less data for i from 200 to 100. It
> reduces input data in 2 times and joined data in 4 times
>
>   def generate = {
>     for{
>       j <- 1 to 10
>       i <- 1 to 100   // previous value was 200
>     } yield(j, i)
>   }
> This code works OK.
>
> We don't understand why 10M is not enough for such simple operation with
> 32000 bytes of ints (2 * 10 * 200 * 2 * 4) approximately? 10M of RAM works
> if we change the data volume in 2 times (2000 of records of (int, int)).
> Why spilling to disk doesn't cover this case?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-at-simple-local-test-tp9490.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to