subject:"\"GC Issues with randomSplit on large dataset\""

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Vladimir Rodionov

GC limit overhead exceeded is usually sign of either inadequate heap size (not the case here) or application produces garbage (temp objects) faster than garbage collector collects them - GC consumes most CPU cycles. 17G of Java heap is quite large for many application and is above "safe and recomme

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Ilya Ganelin

The split is something like 30 million into 2 milion partitions. The reason that it becomes tractable is that after I perform the Cartesian on the split data and operate on it I don't keep the full results - I actually only keep a tiny fraction of that generated dataset - making the overall dataset

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Sean Owen

Can you be more specific about numbers? I am not sure that splitting helps so much in the end, in that it has the same effect as executing a smaller number at a time of the large number of tasks that the full cartesian join would generate. The full join is probably intractable no matter what in thi

GC Issues with randomSplit on large dataset

2014-10-29 Thread Ganelin, Ilya

Hey all – not writing to necessarily get a fix but more to get an understanding of what’s going on internally here. I wish to take a cross-product of two very large RDDs (using cartesian), the product of which is well in excess of what can be stored on disk . Clearly that is intractable, thus m

Re: GC Issues with randomSplit on large dataset

Re: GC Issues with randomSplit on large dataset

Re: GC Issues with randomSplit on large dataset

GC Issues with randomSplit on large dataset

4 matches

Site Navigation

Mail list logo

Footer information