c21 commented on pull request #29566: URL: https://github.com/apache/spark/pull/29566#issuecomment-682418037
> So any improvement in either memory usage, GC, or CPU time by switching to open hashset ? @agrawaldevesh - here is report for one production query internally which is doing a FULL OUTER JOIN for one large table, and one small table. Couldn't share the exact query text and data but the query shape is: ```sql INSERT OVERWRITE TABLE output_table PARTITION (...) SELECT ... FROM large_table a FULL OUTER JOIN small_table b ON a.col_x = f.col_y ``` input metrics: large_table (ORC format): uncompressed data input size: 54TB small_table (ORC format): uncompressed data input size: 85GB execution metrics: Number of tasks per stage: stage 0: 40547 (read large table) stage 1: 158 (read small table) stage 2: 5063 (SHJ and insert to output table) Total shuffle bytes across executors: 15.1TB. Query type | Aggregated executors CPU time (ms) | Aggregated executors GC time (ms) ------------ | ------------- | ------------- use java `HashSet` | 3.48 B | 124.0 M use spark `OpenHashSet` | 3.22 B | 91.2 M TLDR is switching to `OpenHashSet`, we are seeing 7% CPU reduction and 27% GC time reduction. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
