GitHub user jzhuge opened a pull request:

    https://github.com/apache/spark/pull/21911

    [SPARK-24940][SQL] Coalesce Hint for SQL Queries

    ## What changes were proposed in this pull request?
    
    Many Spark SQL users in my company have asked for a way to control the 
number of output files in Spark SQL. The users prefer not to use function 
repartition(n) or coalesce(n, shuffle) that require them to write and deploy 
Scala/Java/Python code. We propose adding the following Hive-style Coalesce 
hint to Spark SQL:
    ```
    /*+ COALESCE(numPartitions[, shuffle]) */
    /*+ REPARTITION(numPartitions[, shuffle]) */
    ```
    Multiple hints are allowed. Multiple nodes are inserted into the logical 
plan, and the optimizer picks the winner.
    ```
    INSERT INTO s /*+ REPARTITION(100), COALESCE(500, true), COALESCE(10) */ 
SELECT * FROM t"
    
    == Logical Plan ==
    'InsertIntoTable 'UnresolvedRelation `s`, false, false
    +- 'UnresolvedHint REPARTITION, [100]
       +- 'UnresolvedHint COALESCE, [500, true]
          +- 'UnresolvedHint COALESCE, [10]
             +- 'Project [*]
                +- 'UnresolvedRelation `t`
    
    == Optimized Logical Plan ==
    InsertIntoHadoopFsRelationCommand ...
    +- Repartition 100, true
       +- HiveTableRelation ...
    ```
    
    Coalesce hints only apply to INSERT while Broadcast hints only apply to 
SELECT. Unfortunately the hints added to the wrong command are silently 
ignored. Haven't found any minimal approach to improve this error checking. 
Maybe add more hint syntax definition to `SqlBase.g4`? Is this desirable? Maybe 
enhance the generic hint framework? Any suggestion is welcome.
    
    ## How was this patch tested?
    
    All unit tests. Manual tests using explain.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jzhuge/spark SPARK-24940

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21911.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21911
    
----
commit 4baa2c43b2338ceb68c434a9e854bc0915cf8611
Author: John Zhuge <jzhuge@...>
Date:   2018-07-28T01:46:42Z

    [SPARK-24940][SQL] Coalesce Hint for SQL Queries

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to