The split is something like 30 million into 2 milion partitions. The reason that it becomes tractable is that after I perform the Cartesian on the split data and operate on it I don't keep the full results - I actually only keep a tiny fraction of that generated dataset - making the overall dataset tractable ( I neglected to mention this in the first email).
The way the code is structured I have forced linear execution until this point so at the time of execution of the split it is the only thing happening. In terms of memory I have assigned 23gb of memory and 17gb of heap. On Oct 30, 2014 3:32 AM, "Sean Owen" <so...@cloudera.com> wrote: > Can you be more specific about numbers? > I am not sure that splitting helps so much in the end, in that it has > the same effect as executing a smaller number at a time of the large > number of tasks that the full cartesian join would generate. > The full join is probably intractable no matter what in this case? > The OOM is not necessarily directly related. It depends on where it > happened, what else you are doing, how much memory you gave, etc. > > On Thu, Oct 30, 2014 at 3:29 AM, Ganelin, Ilya > <ilya.gane...@capitalone.com> wrote: > > Hey all – not writing to necessarily get a fix but more to get an > > understanding of what’s going on internally here. > > > > I wish to take a cross-product of two very large RDDs (using cartesian), > the > > product of which is well in excess of what can be stored on disk . > Clearly > > that is intractable, thus my solution is to do things in batches - > > essentially I can take the cross product of a small piece of the first > data > > set with the entirety of the other. To do this, I calculate how many > items > > can fit into 1 gig of memory. Next, I use RDD.random Split() to partition > > the first data set. The issue is that I am trying to partition an RDD of > > several million items into several million partitions. This throws the > > following error: > > > > I would like to understand the internals of what’s going on here so that > I > > can adjust my approach accordingly. Thanks in advance. > > > > > > 14/10/29 22:17:44 ERROR ActorSystemImpl: Uncaught fatal error from thread > > [sparkDriver-akka.actor.default-dispatcher-16] shutting down ActorSystem > > [sparkDriver] > > java.lang.OutOfMemoryError: GC overhead limit exceeded > > at com.google.protobuf_spark.ByteString.toByteArray(ByteString.java:213) > > at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:24) > > at > > > akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:55) > > at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:55) > > at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:73) > > at > > > akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:764) > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > > at > > > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > at > > > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > at > > > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit > > exceeded > > at java.util.Arrays.copyOfRange(Arrays.java:2694) > > at java.lang.String.<init>(String.java:203) > > at java.lang.String.substring(String.java:1913) > > at java.lang.String.subSequence(String.java:1946) > > at java.util.regex.Matcher.getSubSequence(Matcher.java:1245) > > at java.util.regex.Matcher.group(Matcher.java:490) > > at java.util.Formatter$FormatSpecifier.<init>(Formatter.java:2675) > > at java.util.Formatter.parse(Formatter.java:2528) > > at java.util.Formatter.format(Formatter.java:2469) > > at java.util.Formatter.format(Formatter.java:2423) > > at java.lang.String.format(String.java:2790) > > at > scala.collection.immutable.StringLike$class.format(StringLike.scala:266) > > at scala.collection.immutable.StringOps.format(StringOps.scala:31) > > at org.apache.spark.util.Utils$.getCallSite(Utils.scala:944) > > at org.apache.spark.rdd.RDD.<init>(RDD.scala:1227) > > at org.apache.spark.rdd.RDD.<init>(RDD.scala:83) > > at > > > org.apache.spark.rdd.PartitionwiseSampledRDD.<init>(PartitionwiseSampledRDD.scala:47) > > at org.apache.spark.rdd.RDD$$anonfun$randomSplit$1.apply(RDD.scala:378) > > at org.apache.spark.rdd.RDD$$anonfun$randomSplit$1.apply(RDD.scala:377) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > > at > > > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > > at org.apache.spark.rdd.RDD.randomSplit(RDD.scala:379) > > > > > > > > ________________________________ > > > > The information contained in this e-mail is confidential and/or > proprietary > > to Capital One and/or its affiliates. The information transmitted > herewith > > is intended only for use by the individual or entity to which it is > > addressed. If the reader of this message is not the intended recipient, > you > > are hereby notified that any review, retransmission, dissemination, > > distribution, copying or other use of, or taking of any action in > reliance > > upon this information is strictly prohibited. If you have received this > > communication in error, please contact the sender and delete the material > > from your computer. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >