[ 
https://issues.apache.org/jira/browse/SPARK-31588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17098210#comment-17098210
 ] 

Hyukjin Kwon commented on SPARK-31588:
--------------------------------------

There are many other workarounds already. Can you show a set of more concrete 
examples?
I don't think it;s good idea to use a same repartition or coalesce with hard 
limit number for every job by default.

> merge small files may need more common setting
> ----------------------------------------------
>
>                 Key: SPARK-31588
>                 URL: https://issues.apache.org/jira/browse/SPARK-31588
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.5
>         Environment: spark:2.4.5
> hdp:2.7
>            Reporter: philipse
>            Priority: Major
>
> Hi ,
> SparkSql now allow us to use  repartition or coalesce to manually control the 
> small files like the following
> /*+ REPARTITION(1) */
> /*+ COALESCE(1) */
> But it can only be  tuning case by case ,we need to decide whether we need to 
> use COALESCE or REPARTITION,can we try a more common way to reduce the 
> decision by set the target size  as hive did
> *Good points:*
> 1)we will also the new partitions number
> 2)with an ON-OFF parameter  provided , user can close it if needed
> 3)the parmeter can be set at cluster level instand of user side,it will be 
> more easier to controll samll files.
> 4)greatly reduce the pressue of namenode
>  
> *Not good points:*
> 1)It will add a new task to calculate the target numbers by stastics the out 
> files.
>  
> I don't know whether we have planned this in future.
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to