[GitHub] [incubator-doris] HappenLee opened a new issue #3438: [Proposal] Vectorization query optimization for Doris

GitBox Wed, 29 Apr 2020 09:07:22 -0700


HappenLee opened a new issue #3438:
URL: https://github.com/apache/incubator-doris/issues/3438



   #### Motivation
   At present, the underlying storage in Doris is column storage.Query 
execution needs to be transferred to the query layer for execution by 
row-to-column first. Such an implementation maybe cause the performance problem。
   
   * 1. Row-to-row loss.
   * 2. Can not get better CPU performance without vectorized execution.
   
   So we want to transform the query layer of Doris to vectorized execution so 
that it can be not only stored by columns but is processed by vectors (parts of 
columns), which allows achieving high CPU efficiency. This can benefit query 
performance.
   
   Here I simply implemented a POC to verify whether there is a performance 
improvement
   
   ###### Test environment:
   * **Data set**
   ```Star Schema Benchmark```
       
   *  **Data generation**
     ```
      git clone [email protected]:vadimtk/ssb-dbgen.git
      cd ssb-dbgen
      make
    ```
   Download the **SSBM** code from github and compile it. After the compilation 
is successful, execute the following command to generate 3000W customer data:
   ```
    ./dbgen -s 1000 -T c
   ```
   *  **Build Table and Data Import**
   Use the following statement to create a test table, and import the data 
**customer.tbl** into Doris, the data size is about 3.2GB
   ```
   customer | CREATE TABLE `customer` (
     `C_CUSTKEY` int(11) NULL COMMENT "",
     `C_NAME` varchar(255) NOT NULL COMMENT "",
     `C_ADDRESS` varchar(255) NOT NULL COMMENT "",
     `C_CITY` varchar(255) NOT NULL COMMENT "",
     `C_NATION` varchar(255) NOT NULL COMMENT "",
     `C_REGION` varchar(255) NOT NULL COMMENT "",
     `C_PHONE` varchar(255) NOT NULL COMMENT "",
     `C_MKTSEGMENT` varchar(255) NOT NULL COMMENT ""
   ) ENGINE=OLAP
   DUPLICATE KEY(`C_CUSTKEY`, `C_NAME`, `C_ADDRESS`)
   COMMENT "OLAP"
   DISTRIBUTED BY HASH(`C_CUSTKEY`) BUCKETS 10
   PROPERTIES (
   "storage_type" = "COLUMN"
   ); 
   ```
   *  **Environment**
   ```
   GNU/Linux CentOS 6.3 (Final) build 2.6.32_1-19-0-0
   
   Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
    2 physical CPU package(s)
    24 physical CPU core(s)
    48 logical CPU(s)
   Identifier: Intel64 Family 6 Model 79 Stepping 1
   ProcessorID: F1 06 04 00 FF FB EB BF
   Context Switches/Interrupts: 12174692729137 / 297015608902
   
   
   Memory: 119.5 GiB/125.9 GiB
   ```
   
   Single FE and Single BE in the same server.
   
   ###### Test：
   
     * Modify the logic of Doris' query layer to support the vectorized 
aggregation of column inventory during aggregation calculations. Record the 
time when the row transfer to column:
   
   
![在NewPartitionedAggregationNode之中增加计算器，并且在析构函数之中打印出来](https://upload-images.jianshu.io/upload_images/8552201-87de31d5ffa0a4d7.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
      
   Calculate the loss time of row transter to column 
   
![](https://upload-images.jianshu.io/upload_images/8552201-0853796248b9ada4.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
   
   * **Results**
    ```select max(C_PHONE) from customer group by C_MKTSEGMENT;```
   
        Statistic|Origin| Convert to column | origin(muti-thread) | Convert to 
column(muti-thread) 
     :-:|:-:|:-:|:-:|:-:
     Time | 4.19 Sec | 4.57 Sec - 2.17Sec (Convert Time) | 0.67 Sec |  0.69 Sec 
|  
     Context-Switches | 31,737 |  32,468 |  40,463 |  30,699
     Migrations | 506 | 662 | 4,920 | 3265
     Instructions | 48,890,013,173 | 47,963,367,976  | 49,111,783,565  | 
48,113,904,685 
     IPC | 1.57 | 1.42 | 1.40 | 1.37
     Branches | 9,201,175,036 | 9,124,545,231  | 9,248,803,634| 9,154,186,301
     Branches-Miss % | 0.90% | 1.02%  | 0.91% | 1.02%
   
   #### Implementation
   Doris currently has a corresponding ```VectorizedRowBatch ```implementation. 
So we can gradually complete the optimization each exec node.
   
   1. Starting from ```olap_scan_node```,  using vectorization query  test and 
observe whether there is expected performance improvement
   2. ```exec_node``` need to implement method 
   for ```VectorizedRowBatch``` trans to ```RowBatch``` method, retaining 
compatibility with the original execution logic
   
   
   
     
    
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-doris] HappenLee opened a new issue #3438: [Proposal] Vectorization query optimization for Doris

Reply via email to