This is the culprit: https://issues.apache.org/jira/browse/SPARK-8406
"2. Make `TestHive` use a local mode `SparkContext` with 32 threads to increase parallelism The major reason for this is that, the original parallelism of 2 is too low to reproduce the data loss issue. Also, higher concurrency may potentially caught more concurrency bugs during testing phase. (It did help us spotted SPARK-8501.)" Specific change: http://git-wip-us.apache.org/repos/asf/spark/blob/0818fdec/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala ---------------------------------------------------------------------- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala index f901bd8..ea325cc 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala @@ -49,7 +49,7 @@ import scala.collection.JavaConversions._ object TestHive extends TestHiveContext( new SparkContext( - System.getProperty("spark.sql.test.master", "local[2]"), + System.getProperty("spark.sql.test.master", "local[32]"), "TestSQLContext", new SparkConf() .set("spark.sql.test", "") Setting that to local[8] to match my cores the HiveCompatibilitySuite passes (and runs so much faster!) so maybe that should be changed to limit threads to num cores? Cheers, On 15 September 2015 at 08:50, Pete Robbins <robbin...@gmail.com> wrote: > Ok so it looks like the max number of active tasks reaches 30. I'm not > setting anything as it is a clean environment with clean spark code > checkout. I'll dig further to see why so many tasks are active. > > Cheers, > > On 15 September 2015 at 07:22, Reynold Xin <r...@databricks.com> wrote: > >> Yea I think this is where the heuristics is failing -- it uses 8 cores to >> approximate the number of active tasks, but the tests somehow is using 32 >> (maybe because it explicitly sets it to that, or you set it yourself? I'm >> not sure which one) >> >> On Mon, Sep 14, 2015 at 11:06 PM, Pete Robbins <robbin...@gmail.com> >> wrote: >> >>> Reynold, thanks for replying. >>> >>> getPageSize parameters: maxMemory=515396075, numCores=0 >>> Calculated values: cores=8, default=4194304 >>> >>> So am I getting a large page size as I only have 8 cores? >>> >>> On 15 September 2015 at 00:40, Reynold Xin <r...@databricks.com> wrote: >>> >>>> Pete - can you do me a favor? >>>> >>>> >>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174 >>>> >>>> Print the parameters that are passed into the getPageSize function, and >>>> check their values. >>>> >>>> On Mon, Sep 14, 2015 at 4:32 PM, Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>>> Is this on latest master / branch-1.5? >>>>> >>>>> out of the box we reserve only 16% (0.2 * 0.8) of the memory for >>>>> execution (e.g. aggregate, join) / shuffle sorting. With a 3GB heap, >>>>> that's >>>>> 480MB. So each task gets 480MB / 32 = 15MB, and each operator reserves at >>>>> least one page for execution. If your page size is 4MB, it only takes 3 >>>>> operators to use up its memory. >>>>> >>>>> The thing is page size is dynamically determined -- and in your case >>>>> it should be smaller than 4MB. >>>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174 >>>>> >>>>> Maybe there is a place that in the maven tests that we explicitly set >>>>> the page size (spark.buffer.pageSize) to 4MB? If yes, we need to find it >>>>> and just remove it. >>>>> >>>>> >>>>> On Mon, Sep 14, 2015 at 4:16 AM, Pete Robbins <robbin...@gmail.com> >>>>> wrote: >>>>> >>>>>> I keep hitting errors running the tests on 1.5 such as >>>>>> >>>>>> >>>>>> - join31 *** FAILED *** >>>>>> Failed to execute query using catalyst: >>>>>> Error: Job aborted due to stage failure: Task 9 in stage 3653.0 >>>>>> failed 1 times, most recent failure: Lost task 9.0 in stage 3653.0 (TID >>>>>> 123363, localhost): java.io.IOException: Unable to acquire 4194304 bytes >>>>>> of >>>>>> memory >>>>>> at >>>>>> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) >>>>>> >>>>>> >>>>>> This is using the command >>>>>> build/mvn -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver test >>>>>> >>>>>> >>>>>> I don't see these errors in any of the amplab jenkins builds. Do >>>>>> those builds have any configuration/environment that I may be missing? My >>>>>> build is running with whatever defaults are in the top level pom.xml, eg >>>>>> -Xmx3G. >>>>>> >>>>>> I can make these tests pass by setting >>>>>> spark.shuffle.memoryFraction=0.6 in the HiveCompatibilitySuite rather >>>>>> than >>>>>> the default 0.2 value. >>>>>> >>>>>> Trying to analyze what is going on with the test it is related to the >>>>>> number of active tasks, which seems to rise to 32, and so the >>>>>> ShuffleMemoryManager allows less memory per task even though most of >>>>>> those >>>>>> tasks do not have any memory allocated to them. >>>>>> >>>>>> Has anyone seen issues like this before? >>>>>> >>>>> >>>>> >>>> >>> >> >