[GitHub] [spark] SparksFyz commented on pull request #34953: [SPARK-37682][SQL][WIP]Apply 'merged column' and 'bit vector' in RewriteDistinctAggregates

GitBox Wed, 22 Dec 2021 04:29:24 -0800


SparksFyz commented on pull request #34953:
URL: https://github.com/apache/spark/pull/34953#issuecomment-999541472



   We encountered a problem when execute SQL contains multiple count distinct 
expressions. EXPAND operator generates huge size of data lead to running out of 
disk space when shuffle, 
   especially combined with GROUPING SET(It can generator another EXPAND 
operator lead to more expansion, shuffle write data size exceed 100T in some 
cases). This PR contains two optimizations to reduce data expansion:
   1. Merge same data type columns into one column.
   2. Resolve conditions such as case when or filter and merge conditions into 
a Long type BitVector column, exceed 64 will create another one.
   
   There are two cases to help us understand two optimizations by comparing the 
projection for expand operator:
   
   Op1: Merge Column. Column c1 and c2 is same type, for example String
   
   ```sql
   select
     dim
     ,sum(c1) as m1
     ,count(distinct c1) as m2
     ,count(distinct c2) as m3
   from table
   group by dim
   ```
   
![image](https://user-images.githubusercontent.com/8748814/147089793-bcee1b9b-a8cf-4fc7-8ad4-8f61fd118f6f.png)
   PS: Merge Columns can reduce overhead of `null` values, it can reduce 5% - 
10% data size from our test.
   
   Op2: BitVector
   
   ```sql
   select 
     dim
     ,sum(c1) as m1
     ,count(distinct case when c1 > 1 then c2 end) as m2
     ,count(distinct case when c1 > 2 then c2 end) as m3
   from table
   group by dim
   ```
   
![image](https://user-images.githubusercontent.com/8748814/147090120-0badbf5d-b4e5-4755-a261-5b1fac508c8f.png)
   PS: This Optimization can reduce both columns and rows. In addition, d_value 
and c2_value can project to null when bitVector equals 0. This OP usually 
reduces more than 50% data size in out test.
   
   We have tested some typical spark jobs which contain multiple `count 
distinct` from prod environment. Job stats are mentioned below:
   
   Case 1: Simple case for only merge columns
   Before the PR:
   
![image](https://user-images.githubusercontent.com/8748814/147091978-295f8079-09ae-4b70-9feb-2978a31ef3bf.png)
   
   After the PR:
   
![image](https://user-images.githubusercontent.com/8748814/147092089-a7ca9dd7-9872-41ca-acfd-c674f23ac580.png)
   
   
   Case2: A litter bit complex SQL which contains more dim and more count 
distinct metrics:
   Before the PR:
   
![image](https://user-images.githubusercontent.com/8748814/147092556-16f07c45-cd46-4840-9c5b-d886ce84c7ec.png)
   
   After the PR:
   
![image](https://user-images.githubusercontent.com/8748814/147092437-a4ae80b6-0481-425d-b2cd-22b35becfcd1.png)
   
   
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] SparksFyz commented on pull request #34953: [SPARK-37682][SQL][WIP]Apply 'merged column' and 'bit vector' in RewriteDistinctAggregates

Reply via email to