Re: GC Issues with randomSplit on large dataset

Ilya Ganelin Thu, 30 Oct 2014 05:25:17 -0700

The split is something like 30 million into 2 milion partitions. The reason
that it becomes tractable is that after I perform the Cartesian on the
split data and operate on it I don't keep the full results - I actually
only keep a tiny fraction of that generated dataset - making the overall
dataset tractable ( I neglected to mention this in the first email).


The way the code is structured I have forced linear execution until this
point so at the time of execution of the split it is the only thing
happening. In terms of memory I have assigned 23gb of memory and 17gb of
heap.
On Oct 30, 2014 3:32 AM, "Sean Owen" <so...@cloudera.com> wrote:

> Can you be more specific about numbers?
> I am not sure that splitting helps so much in the end, in that it has
> the same effect as executing a smaller number at a time of the large
> number of tasks that the full cartesian join would generate.
> The full join is probably intractable no matter what in this case?
> The OOM is not necessarily directly related. It depends on where it
> happened, what else you are doing, how much memory you gave, etc.
>
> On Thu, Oct 30, 2014 at 3:29 AM, Ganelin, Ilya
> <ilya.gane...@capitalone.com> wrote:
> > Hey all – not writing to necessarily get a fix but more to get an
> > understanding of what’s going on internally here.
> >
> > I wish to take a cross-product of two very large RDDs (using cartesian),
> the
> > product of which is well in excess of what can be stored on disk .
> Clearly
> > that is intractable, thus my solution is to do things in batches -
> > essentially I can take the cross product of a small piece of the first
> data
> > set with the entirety of the other. To do this, I calculate how many
> items
> > can fit into 1 gig of memory. Next, I use RDD.random Split() to partition
> > the first data set. The issue is that I am trying to partition an RDD of
> > several million items into several million partitions. This throws the
> > following error:
> >
> > I would like to understand the internals of what’s going on here so that
> I
> > can adjust my approach accordingly. Thanks in advance.
> >
> >
> > 14/10/29 22:17:44 ERROR ActorSystemImpl: Uncaught fatal error from thread
> > [sparkDriver-akka.actor.default-dispatcher-16] shutting down ActorSystem
> > [sparkDriver]
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > at com.google.protobuf_spark.ByteString.toByteArray(ByteString.java:213)
> > at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:24)
> > at
> >
> akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:55)
> > at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:55)
> > at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:73)
> > at
> >
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:764)
> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> > at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> > at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> > at
> >
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > at
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > at
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
> > exceeded
> > at java.util.Arrays.copyOfRange(Arrays.java:2694)
> > at java.lang.String.<init>(String.java:203)
> > at java.lang.String.substring(String.java:1913)
> > at java.lang.String.subSequence(String.java:1946)
> > at java.util.regex.Matcher.getSubSequence(Matcher.java:1245)
> > at java.util.regex.Matcher.group(Matcher.java:490)
> > at java.util.Formatter$FormatSpecifier.<init>(Formatter.java:2675)
> > at java.util.Formatter.parse(Formatter.java:2528)
> > at java.util.Formatter.format(Formatter.java:2469)
> > at java.util.Formatter.format(Formatter.java:2423)
> > at java.lang.String.format(String.java:2790)
> > at
> scala.collection.immutable.StringLike$class.format(StringLike.scala:266)
> > at scala.collection.immutable.StringOps.format(StringOps.scala:31)
> > at org.apache.spark.util.Utils$.getCallSite(Utils.scala:944)
> > at org.apache.spark.rdd.RDD.<init>(RDD.scala:1227)
> > at org.apache.spark.rdd.RDD.<init>(RDD.scala:83)
> > at
> >
> org.apache.spark.rdd.PartitionwiseSampledRDD.<init>(PartitionwiseSampledRDD.scala:47)
> > at org.apache.spark.rdd.RDD$$anonfun$randomSplit$1.apply(RDD.scala:378)
> > at org.apache.spark.rdd.RDD$$anonfun$randomSplit$1.apply(RDD.scala:377)
> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> > at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> > at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> > at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> > at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> > at
> >
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> > at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> > at org.apache.spark.rdd.RDD.randomSplit(RDD.scala:379)
> >
> >
> >
> > ________________________________
> >
> > The information contained in this e-mail is confidential and/or
> proprietary
> > to Capital One and/or its affiliates. The information transmitted
> herewith
> > is intended only for use by the individual or entity to which it is
> > addressed.  If the reader of this message is not the intended recipient,
> you
> > are hereby notified that any review, retransmission, dissemination,
> > distribution, copying or other use of, or taking of any action in
> reliance
> > upon this information is strictly prohibited. If you have received this
> > communication in error, please contact the sender and delete the material
> > from your computer.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: GC Issues with randomSplit on large dataset

Reply via email to