[GitHub] spark pull request: [SPARK-13805][SQL] Generate code that get a va...

kiszk Thu, 10 Mar 2016 11:33:36 -0800

GitHub user kiszk opened a pull request:

    https://github.com/apache/spark/pull/11636


    [SPARK-13805][SQL] Generate code that get a value in each column from 
ColumnVector when ColumnarVector is used

    ## What changes were proposed in this pull request?
    
    This PR generates code that get a value in each column from 
```ColumnVector``` instead of creating ```InternalRow``` when 
```ColumnarBatch``` is accessed. This PR improves benchmark program by up to 
15%.
    This PR consists of two parts:
    
    1. Get an ```ColumnVector ``` by using ```ColumnarBatch.column()``` method
    2. Get a value of each column by using 
```rdd_col${COLIDX}.getInt(ROWIDX)``` instead of ```rdd_row.getInt(COLIDX)```
    
    
    This is a motivated example.
    ````
        
sqlContext.conf.setConfString(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, 
"true")
        sqlContext.conf.setConfString(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, 
"true")
        val values = 10
        withTempPath { dir =>
          withTempTable("t1", "tempTable") {
            sqlContext.range(values).registerTempTable("t1")
            sqlContext.sql("select id % 2 as p, cast(id as INT) as id from t1")
              .write.partitionBy("p").parquet(dir.getCanonicalPath)
            
sqlContext.read.parquet(dir.getCanonicalPath).registerTempTable("tempTable")
            sqlContext.sql("select sum(p) from tempTable").collect
          }
        }
    ```` 
    
    The original code
    ````
        ...
        /* 072 */       while (!shouldStop() && rdd_batchIdx < numRows) {
        /* 073 */         InternalRow rdd_row = 
rdd_batch.getRow(rdd_batchIdx++);
        /* 074 */         /*** CONSUME: TungstenAggregate(key=[], 
functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], 
output=[sum#10L]) */
        /* 075 */         /* input[0, int] */
        /* 076 */         boolean rdd_isNull = rdd_row.isNullAt(0);
        /* 077 */         int rdd_value = rdd_isNull ? -1 : (rdd_row.getInt(0));
        ...
    ````
    
    The code generated by this PR
    ````
        /* 072 */       while (!shouldStop() && rdd_batchIdx < numRows) {
        /* 073 */         
org.apache.spark.sql.execution.vectorized.ColumnVector rdd_col0 = 
rdd_batch.column(0);
        /* 074 */         /*** CONSUME: TungstenAggregate(key=[], 
functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], 
output=[sum#10L]) */
        /* 075 */         /* input[0, int] */
        /* 076 */         boolean rdd_isNull = rdd_col0.getIsNull(rdd_batchIdx);
        /* 077 */         int rdd_value = rdd_isNull ? -1 : 
(rdd_col0.getInt(rdd_batchIdx));
        ...
        /* 128 */         rdd_batchIdx++;
        /* 129 */       }
        /* 130 */       if (shouldStop()) return;
    
    ````
    Performance
    Without this PR
    ````
    model name  : Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz
    Partitioned Table:                  Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
    
-------------------------------------------------------------------------------------------
    Read data column                          434 /  488         36.3          
27.6       1.0X
    Read partition column                     302 /  346         52.1          
19.2       1.4X
    Read both columns                         588 /  643         26.8          
37.4       0.7X
    ````
    With this PR
    ````
    model name  : Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz
    Partitioned Table:                  Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
    
-------------------------------------------------------------------------------------------
    Read data column                          392 /  516         40.1          
24.9       1.0X
    Read partition column                     256 /  318         61.4          
16.3       1.5X
    Read both columns                         523 /  539         30.1          
33.3       0.7X
    ````
    
    ## How was this patch tested?
    Tested by existing test suites and benchmark


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kiszk/spark SPARK-13805

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11636.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11636
    
----
commit 0db679f9bdc410cd7483faa366557cc5e5925889
Author: Kazuaki Ishizaki <[email protected]>
Date:   2016-03-10T19:15:05Z

    Generate code that get a value in each column from ColumnVector when 
ColumnarVector is used

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-13805][SQL] Generate code that get a va...

Reply via email to