[ 
https://issues.apache.org/jira/browse/SPARK-24940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-24940:
-------------------------------
    Description: 
Many Spark SQL users in my company have asked for a way to control the number 
of output files in Spark SQL. The users prefer not to use function 
repartition(n) or coalesce(n) that require them to write and deploy 
Scala/Java/Python code.
   
 The DataFrame API has repartition/coalesce for a long time. However, we do not 
have an equivalent functionality in SQL queries. We propose adding the 
following Hive-style Coalesce and Repartition Hint to Spark SQL.
{noformat}
INSERT ... SELECT /*+ COALESCE(numPartitions) */ ...
INSERT ... SELECT /*+ REPARTITION(numPartitions) */ ...
{noformat}
Hint names are case insensitive.

Coalesce Hint reduces the number of partitions. It only merges partitions thus 
minimizes the data movement.

Repartition Hint can either increase or decrease the number of partitions. It 
performs full shuffle of data and ensures data is equally distributed.

Repartition adds a new stage, so it does not affect the parallelism of the 
existing stage. In contrast, Coalesce does affect the parallelism of the 
existing stage since it does not add a new stage.

Multiple Inserts Queries and Named Subqueries are also supported.

  was:
Many Spark SQL users in my company have asked for a way to control the number 
of output files in Spark SQL. The users prefer not to use function 
repartition\(n\) or coalesce(n, shuffle) that require them to write and deploy 
Scala/Java/Python code.
  
 There are use cases to either reduce or increase the number.
  
 The DataFrame API has repartition/coalesce for a long time. However, we do not 
have an equivalent functionality in SQL queries. We propose adding the 
following Hive-style Coalesce hint to Spark SQL.
{noformat}
/*+ COALESCE(n, shuffle) */
/*+ REPARTITION(n) */
{noformat}
REPARTITION\(n\) is equal to COALESCE(n, shuffle=true).


> Coalesce and Repartition Hint for SQL Queries
> ---------------------------------------------
>
>                 Key: SPARK-24940
>                 URL: https://issues.apache.org/jira/browse/SPARK-24940
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.1
>            Reporter: John Zhuge
>            Assignee: John Zhuge
>            Priority: Major
>             Fix For: 2.4.0
>
>
> Many Spark SQL users in my company have asked for a way to control the number 
> of output files in Spark SQL. The users prefer not to use function 
> repartition(n) or coalesce(n) that require them to write and deploy 
> Scala/Java/Python code.
>    
>  The DataFrame API has repartition/coalesce for a long time. However, we do 
> not have an equivalent functionality in SQL queries. We propose adding the 
> following Hive-style Coalesce and Repartition Hint to Spark SQL.
> {noformat}
> INSERT ... SELECT /*+ COALESCE(numPartitions) */ ...
> INSERT ... SELECT /*+ REPARTITION(numPartitions) */ ...
> {noformat}
> Hint names are case insensitive.
> Coalesce Hint reduces the number of partitions. It only merges partitions 
> thus minimizes the data movement.
> Repartition Hint can either increase or decrease the number of partitions. It 
> performs full shuffle of data and ensures data is equally distributed.
> Repartition adds a new stage, so it does not affect the parallelism of the 
> existing stage. In contrast, Coalesce does affect the parallelism of the 
> existing stage since it does not add a new stage.
> Multiple Inserts Queries and Named Subqueries are also supported.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to