drow blonde messi created SPARK-16766:
-----------------------------------------

             Summary: TakeOrderedAndProjectExec easily cause OOM
                 Key: SPARK-16766
                 URL: https://issues.apache.org/jira/browse/SPARK-16766
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0, 1.6.2
            Reporter: drow blonde messi
            Priority: Critical


I found that a very simple SQL statement can easily cause a OOM.


Like this:
"insert into xyz2 select * from xyz order by x limit 900000000;"



The problem is obvious: TakeOrderedAndProjectExec always malloc a huge Object 
array(array size equals to the limit count) when the executeCollect or 
doExecute is called.


In Spark 1.6,  terminal/non-terminal TakeOrderedAndProject works the same way: 
call the RDD.takeOrdered(limit), which produces a huge BoundedPriorityQueue for 
every partition.

In Spark 2.0, non-terminal TakeOrderedAndProject switch to use the  
org.apache.spark.util.collection.Utils.takeOrdered, but the problem is still 
exists, the expression ordering.leastOf(input.asJava, num).iterator.asScala 
calls the leastOf method of com.google.common.collect.Ordering, and a large 
Object Array is produced:

    int bufferCap = k * 2;
    @SuppressWarnings("unchecked") // we'll only put E's in
    E[] buffer = (E[]) new Object[bufferCap];








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to