[ https://issues.apache.org/jira/browse/SPARK-31588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102414#comment-17102414 ]
philipse commented on SPARK-31588: ---------------------------------- yes, the block size can be controlled in HDFS.i mean we just take the block size as one the the condition. if we can control the target size in SPARK, we can control the real data in HDFS,instand using repartition control the hard limit. > merge small files may need more common setting > ---------------------------------------------- > > Key: SPARK-31588 > URL: https://issues.apache.org/jira/browse/SPARK-31588 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.5 > Environment: spark:2.4.5 > hdp:2.7 > Reporter: philipse > Priority: Major > > Hi , > SparkSql now allow us to use repartition or coalesce to manually control the > small files like the following > /*+ REPARTITION(1) */ > /*+ COALESCE(1) */ > But it can only be tuning case by case ,we need to decide whether we need to > use COALESCE or REPARTITION,can we try a more common way to reduce the > decision by set the target size as hive did > *Good points:* > 1)we will also the new partitions number > 2)with an ON-OFF parameter provided , user can close it if needed > 3)the parmeter can be set at cluster level instand of user side,it will be > more easier to controll samll files. > 4)greatly reduce the pressue of namenode > > *Not good points:* > 1)It will add a new task to calculate the target numbers by stastics the out > files. > > I don't know whether we have planned this in future. > > Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org