GitHub user wangyum opened a pull request:

    https://github.com/apache/spark/pull/19831

    [SPARK-22489][SQL] Wrong Hive table statistics may trigger OOM if enables 
join reorder in CBO

    ## What changes were proposed in this pull request?
    
    How to reproduce:
    ```basg
    bin/spark-shell --conf spark.sql.cbo.enabled=true --conf 
spark.sql.cbo.joinReorder.enabled=true
    ```
    ```scala
    import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec
    
    spark.sql("CREATE TABLE small (c1 bigint) TBLPROPERTIES ('numRows'='3', 
'rawDataSize'='600','totalSize'='800')")
    // Big table with wrong statistics, numRows=0
    spark.sql("CREATE TABLE big (c1 bigint) TBLPROPERTIES ('numRows'='0', 
'rawDataSize'='60000000000', 'totalSize'='8000000000000')")
    
    val plan = spark.sql("select * from small t1 join big t2 on (t1.c1 = 
t2.c1)").queryExecution.executedPlan
    val buildSide = 
plan.children.head.asInstanceOf[BroadcastHashJoinExec].buildSide
    
    println(buildSide)
    ```
    
    The result is `BuildRight`, but the right side is the big table.
    
    
    For `big` table, `totalSize` or `rawDataSize` > 0, rowCount = 0. At least 
one other is wrong here.
    
https://github.com/apache/spark/blob/ed7352f2191308965a1b2abb6cd075a90b7f7bb7/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L432-L434
    
    This pr to ensure that the `totalSize` or `rawDataSize` > 0, rowCount also 
must be > 0.
    
    ## How was this patch tested?
    
    unit tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wangyum/spark SPARK-22626

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19831.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19831
    
----
commit ed7352f2191308965a1b2abb6cd075a90b7f7bb7
Author: Yuming Wang <[email protected]>
Date:   2017-11-28T08:56:16Z

    if dataSize > 0, rowCount should bigger than 0.

commit b16f88ef971040e682fafe28f0ff06877814e3df
Author: Yuming Wang <[email protected]>
Date:   2017-11-28T10:33:46Z

    add test

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to