Re: Unable to acquire memory errors in HiveCompatibilitySuite

Pete Robbins Tue, 15 Sep 2015 03:39:04 -0700

This is the culprit:

https://issues.apache.org/jira/browse/SPARK-8406


"2.  Make `TestHive` use a local mode `SparkContext` with 32 threads to
increase parallelism

    The major reason for this is that, the original parallelism of 2 is too
low to reproduce
the data loss issue.  Also, higher concurrency may potentially caught more
concurrency bugs
during testing phase. (It did help us spotted SPARK-8501.)"

Specific change:

http://git-wip-us.apache.org/repos/asf/spark/blob/0818fdec/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala
----------------------------------------------------------------------
diff --git
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala
index f901bd8..ea325cc 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala
@@ -49,7 +49,7 @@ import scala.collection.JavaConversions._
 object TestHive
   extends TestHiveContext(
     new SparkContext(
-      System.getProperty("spark.sql.test.master", "local[2]"),
+      System.getProperty("spark.sql.test.master", "local[32]"),
       "TestSQLContext",
       new SparkConf()
         .set("spark.sql.test", "")



Setting that to local[8] to match my cores the HiveCompatibilitySuite
passes (and runs so much faster!) so maybe that should be changed to limit
threads to num cores?

Cheers,

On 15 September 2015 at 08:50, Pete Robbins <robbin...@gmail.com> wrote:

> Ok so it looks like the max number of active tasks reaches 30. I'm not
> setting anything as it is a clean environment with clean spark code
> checkout. I'll dig further to see why so many tasks are active.
>
> Cheers,
>
> On 15 September 2015 at 07:22, Reynold Xin <r...@databricks.com> wrote:
>
>> Yea I think this is where the heuristics is failing -- it uses 8 cores to
>> approximate the number of active tasks, but the tests somehow is using 32
>> (maybe because it explicitly sets it to that, or you set it yourself? I'm
>> not sure which one)
>>
>> On Mon, Sep 14, 2015 at 11:06 PM, Pete Robbins <robbin...@gmail.com>
>> wrote:
>>
>>> Reynold, thanks for replying.
>>>
>>> getPageSize parameters: maxMemory=515396075, numCores=0
>>> Calculated values: cores=8, default=4194304
>>>
>>> So am I getting a large page size as I only have 8 cores?
>>>
>>> On 15 September 2015 at 00:40, Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> Pete - can you do me a favor?
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174
>>>>
>>>> Print the parameters that are passed into the getPageSize function, and
>>>> check their values.
>>>>
>>>> On Mon, Sep 14, 2015 at 4:32 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Is this on latest master / branch-1.5?
>>>>>
>>>>> out of the box we reserve only 16% (0.2 * 0.8) of the memory for
>>>>> execution (e.g. aggregate, join) / shuffle sorting. With a 3GB heap, 
>>>>> that's
>>>>> 480MB. So each task gets 480MB / 32 = 15MB, and each operator reserves at
>>>>> least one page for execution. If your page size is 4MB, it only takes 3
>>>>> operators to use up its memory.
>>>>>
>>>>> The thing is page size is dynamically determined -- and in your case
>>>>> it should be smaller than 4MB.
>>>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174
>>>>>
>>>>> Maybe there is a place that in the maven tests that we explicitly set
>>>>> the page size (spark.buffer.pageSize) to 4MB? If yes, we need to find it
>>>>> and just remove it.
>>>>>
>>>>>
>>>>> On Mon, Sep 14, 2015 at 4:16 AM, Pete Robbins <robbin...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I keep hitting errors running the tests on 1.5 such as
>>>>>>
>>>>>>
>>>>>> - join31 *** FAILED ***
>>>>>>   Failed to execute query using catalyst:
>>>>>>   Error: Job aborted due to stage failure: Task 9 in stage 3653.0
>>>>>> failed 1 times, most recent failure: Lost task 9.0 in stage 3653.0 (TID
>>>>>> 123363, localhost): java.io.IOException: Unable to acquire 4194304 bytes 
>>>>>> of
>>>>>> memory
>>>>>>       at
>>>>>> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
>>>>>>
>>>>>>
>>>>>> This is using the command
>>>>>> build/mvn -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver  test
>>>>>>
>>>>>>
>>>>>> I don't see these errors in any of the amplab jenkins builds. Do
>>>>>> those builds have any configuration/environment that I may be missing? My
>>>>>> build is running with whatever defaults are in the top level pom.xml, eg
>>>>>> -Xmx3G.
>>>>>>
>>>>>> I can make these tests pass by setting
>>>>>> spark.shuffle.memoryFraction=0.6 in the HiveCompatibilitySuite rather 
>>>>>> than
>>>>>> the default 0.2 value.
>>>>>>
>>>>>> Trying to analyze what is going on with the test it is related to the
>>>>>> number of active tasks, which seems to rise to 32, and so the
>>>>>> ShuffleMemoryManager allows less memory per task even though most of 
>>>>>> those
>>>>>> tasks do not have any memory allocated to them.
>>>>>>
>>>>>> Has anyone seen issues like this before?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Unable to acquire memory errors in HiveCompatibilitySuite

Reply via email to