c21 commented on pull request #29566:
URL: https://github.com/apache/spark/pull/29566#issuecomment-682418037


   > So any improvement in either memory usage, GC, or CPU time by switching to 
open hashset ?
   
   @agrawaldevesh - here is report for one production query internally which is 
doing a FULL OUTER JOIN for one large table, and one small table. Couldn't 
share the exact query text and data but the query shape is:
   
   ```sql
   INSERT OVERWRITE TABLE output_table
   PARTITION (...)
   SELECT ...
   FROM large_table a
   FULL OUTER JOIN small_table b
   ON a.col_x = f.col_y
   ```
   
   input metrics:
   large_table (ORC format): uncompressed data input size: 54TB
   small_table (ORC format): uncompressed data input size: 85GB
   
   execution metrics:
   
   Number of tasks per stage:
   stage 0: 40547 (read large table)
   stage 1: 158 (read small table)
   stage 2: 5063 (SHJ and insert to output table)
   
   Total shuffle bytes across executors: 15.1TB.
   
   Query type | Aggregated executors CPU time (ms) | Aggregated executors GC 
time (ms)
   ------------ | ------------- | -------------
   use java `HashSet` | 3.48 B | 124.0 M
   use spark `OpenHashSet` | 3.22 B | 91.2 M 
   
   TLDR is switching to `OpenHashSet`, we are seeing 7% CPU reduction and 27% 
GC time reduction.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to