Re: [I] [VL] Gluten write more shuffle data than vanilla Spark [incubator-gluten]

via GitHub Thu, 27 Feb 2025 20:01:53 -0800


boneanxs commented on issue #8833:
URL: 
https://github.com/apache/incubator-gluten/issues/8833#issuecomment-2689663025


   The output consists of two columns, and the partition number is set to 100.
   
   ```
   Input [3]: [hash_partition_key#1821, strategies#193, attributes#197]
   Arguments: hashpartitioning(cast(get_json_object(attributes#197, $.SellerID) 
as bigint), 100, None), ENSURE_REQUIREMENTS, [strategies#193, attributes#197], 
[id=#1768], [shuffle_writer_type=hash]
   ```
   by changing to sort type, it looks normal now.
   
   Besides, I see these 2 values are set very high, do we want to avoid using 
sort based shuffle(maybe hash shuffle usually is faster)?
   
   ```scala
     val COLUMNAR_SHUFFLE_SORT_PARTITIONS_THRESHOLD =
       buildConf("spark.gluten.sql.columnar.shuffle.sort.partitions.threshold")
         .internal()
         .doc("The threshold to determine whether to use sort-based columnar 
shuffle. Sort-based " +
           "shuffle will be used if the number of partitions is greater than 
this threshold.")
         .intConf
         .createWithDefault(100000)
   
     val COLUMNAR_SHUFFLE_SORT_COLUMNS_THRESHOLD =
       buildConf("spark.gluten.sql.columnar.shuffle.sort.columns.threshold")
         .internal()
         .doc("The threshold to determine whether to use sort-based columnar 
shuffle. Sort-based " +
           "shuffle will be used if the number of columns is greater than this 
threshold.")
         .intConf
         .createWithDefault(100000)
   ```
   
   Are there any best practices for selecting between hash-based and sort-based 
shuffle when using Gluten? If we can automate this selection, it could simplify 
future upgrades to Gluten. Now some jobs get poor performance because of this.
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [VL] Gluten write more shuffle data than vanilla Spark [incubator-gluten]

Reply via email to