[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644141#comment-17644141
 ] 

Zhe Dong commented on SPARK-41386:
----------------------------------

 
{noformat}
    if (mapStats.isEmpty ||
      mapStats.get.bytesByPartitionId.forall(_ <= advisorySize && _ >= 
advisorySize * smallPartitionFactor )) {
      return shuffle
    }

--------------------------------------------------------------------------------
      if (bytes > targetSize) {
            ... 
      } else if ( bytes < targetSize * smallPartitionFactor ){
           CoalescedPartitionSpec(reduceIndex, reduceIndex + 1, bytes) :: Nil
      }else {        
           return shuffle // dummy
       }{noformat}
 

 

> There are some small files when using rebalance(column)
> -------------------------------------------------------
>
>                 Key: SPARK-41386
>                 URL: https://issues.apache.org/jira/browse/SPARK-41386
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Zhe Dong
>            Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-00000-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-00001-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-00002-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-00003-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-00004-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-00005-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to