Ok so it looks like the max number of active tasks reaches 30. I'm not setting anything as it is a clean environment with clean spark code checkout. I'll dig further to see why so many tasks are active.
Cheers, On 15 September 2015 at 07:22, Reynold Xin <r...@databricks.com> wrote: > Yea I think this is where the heuristics is failing -- it uses 8 cores to > approximate the number of active tasks, but the tests somehow is using 32 > (maybe because it explicitly sets it to that, or you set it yourself? I'm > not sure which one) > > On Mon, Sep 14, 2015 at 11:06 PM, Pete Robbins <robbin...@gmail.com> > wrote: > >> Reynold, thanks for replying. >> >> getPageSize parameters: maxMemory=515396075, numCores=0 >> Calculated values: cores=8, default=4194304 >> >> So am I getting a large page size as I only have 8 cores? >> >> On 15 September 2015 at 00:40, Reynold Xin <r...@databricks.com> wrote: >> >>> Pete - can you do me a favor? >>> >>> >>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174 >>> >>> Print the parameters that are passed into the getPageSize function, and >>> check their values. >>> >>> On Mon, Sep 14, 2015 at 4:32 PM, Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> Is this on latest master / branch-1.5? >>>> >>>> out of the box we reserve only 16% (0.2 * 0.8) of the memory for >>>> execution (e.g. aggregate, join) / shuffle sorting. With a 3GB heap, that's >>>> 480MB. So each task gets 480MB / 32 = 15MB, and each operator reserves at >>>> least one page for execution. If your page size is 4MB, it only takes 3 >>>> operators to use up its memory. >>>> >>>> The thing is page size is dynamically determined -- and in your case it >>>> should be smaller than 4MB. >>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174 >>>> >>>> Maybe there is a place that in the maven tests that we explicitly set >>>> the page size (spark.buffer.pageSize) to 4MB? If yes, we need to find it >>>> and just remove it. >>>> >>>> >>>> On Mon, Sep 14, 2015 at 4:16 AM, Pete Robbins <robbin...@gmail.com> >>>> wrote: >>>> >>>>> I keep hitting errors running the tests on 1.5 such as >>>>> >>>>> >>>>> - join31 *** FAILED *** >>>>> Failed to execute query using catalyst: >>>>> Error: Job aborted due to stage failure: Task 9 in stage 3653.0 >>>>> failed 1 times, most recent failure: Lost task 9.0 in stage 3653.0 (TID >>>>> 123363, localhost): java.io.IOException: Unable to acquire 4194304 bytes >>>>> of >>>>> memory >>>>> at >>>>> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) >>>>> >>>>> >>>>> This is using the command >>>>> build/mvn -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver test >>>>> >>>>> >>>>> I don't see these errors in any of the amplab jenkins builds. Do those >>>>> builds have any configuration/environment that I may be missing? My build >>>>> is running with whatever defaults are in the top level pom.xml, eg -Xmx3G. >>>>> >>>>> I can make these tests pass by setting >>>>> spark.shuffle.memoryFraction=0.6 in the HiveCompatibilitySuite rather than >>>>> the default 0.2 value. >>>>> >>>>> Trying to analyze what is going on with the test it is related to the >>>>> number of active tasks, which seems to rise to 32, and so the >>>>> ShuffleMemoryManager allows less memory per task even though most of those >>>>> tasks do not have any memory allocated to them. >>>>> >>>>> Has anyone seen issues like this before? >>>>> >>>> >>>> >>> >> >