10M is tiny compared to all of the overhead of running a lot complex Scala based app in a JVM. I think you may be bumping up against practical minimum sizes and that you may find it is not really the data size? I don't think it really scales down this far. On Nov 22, 2014 2:24 PM, "rzykov" <rzy...@gmail.com> wrote:
> Dear all, > > Unfortunately I've not got ant respond in users forum. That's why I decided > to publish this question here. > We encountered problems of failed jobs with huge amount of data. For > example, an application works perfectly with relative small sized data, but > when it grows in 2 times this application fails. > > A simple local test was prepared for this question at > https://gist.github.com/copy-of-rezo/6a137e13a1e4f841e7eb > It generates 2 sets of key-value pairs, join them, selects distinct values > and counts data finally. > > object Spill { > def generate = { > for{ > j <- 1 to 10 > i <- 1 to 200 > } yield(j, i) > } > > def main(args: Array[String]) { > val conf = new SparkConf().setAppName(getClass.getSimpleName) > conf.set("spark.shuffle.spill", "true") > conf.set("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > val sc = new SparkContext(conf) > println(generate) > > val dataA = sc.parallelize(generate) > val dataB = sc.parallelize(generate) > val dst = dataA.join(dataB).distinct().count() > println(dst) > } > } > > We compiled it locally and run 3 times with different settings of memory: > 1) --executor-memory 10M --driver-memory 10M --num-executors 1 > --executor-cores 1 > It fails wtih "java.lang.OutOfMemoryError: GC overhead limit exceeded" at > ..... > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137) > > 2) --executor-memory 20M --driver-memory 20M --num-executors 1 > --executor-cores 1 > It works OK > > 3) --executor-memory 10M --driver-memory 10M --num-executors 1 > --executor-cores 1 But let's make less data for i from 200 to 100. It > reduces input data in 2 times and joined data in 4 times > > def generate = { > for{ > j <- 1 to 10 > i <- 1 to 100 // previous value was 200 > } yield(j, i) > } > This code works OK. > > We don't understand why 10M is not enough for such simple operation with > 32000 bytes of ints (2 * 10 * 200 * 2 * 4) approximately? 10M of RAM works > if we change the data volume in 2 times (2000 of records of (int, int)). > Why spilling to disk doesn't cover this case? > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-OutOfMemoryError-at-simple-local-test-tp9490.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >