[
https://issues.apache.org/jira/browse/SPARK-31588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17098210#comment-17098210
]
Hyukjin Kwon commented on SPARK-31588:
--------------------------------------
There are many other workarounds already. Can you show a set of more concrete
examples?
I don't think it;s good idea to use a same repartition or coalesce with hard
limit number for every job by default.
> merge small files may need more common setting
> ----------------------------------------------
>
> Key: SPARK-31588
> URL: https://issues.apache.org/jira/browse/SPARK-31588
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.4.5
> Environment: spark:2.4.5
> hdp:2.7
> Reporter: philipse
> Priority: Major
>
> Hi ,
> SparkSql now allow us to use repartition or coalesce to manually control the
> small files like the following
> /*+ REPARTITION(1) */
> /*+ COALESCE(1) */
> But it can only be tuning case by case ,we need to decide whether we need to
> use COALESCE or REPARTITION,can we try a more common way to reduce the
> decision by set the target size as hive did
> *Good points:*
> 1)we will also the new partitions number
> 2)with an ON-OFF parameter provided , user can close it if needed
> 3)the parmeter can be set at cluster level instand of user side,it will be
> more easier to controll samll files.
> 4)greatly reduce the pressue of namenode
>
> *Not good points:*
> 1)It will add a new task to calculate the target numbers by stastics the out
> files.
>
> I don't know whether we have planned this in future.
>
> Thanks
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]