GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/10809

    [SPARK-12879][SQL] improve the unsafe row writing framework

    As we begin to use unsafe row writing framework(`BufferHolder` and 
`UnsafeRowWriter`) in more and more places(`UnsafeProjection`, 
`UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add 
more doc to it and make it easier to use.
    
    This PR abstract the technique used in `UnsafeRowParquetRecordReader`: 
avoid unnecessary operatition as more as possible. For example, do not always 
point the row to the buffer at the end, we only need to update the size of row. 
If all fields are of primitive type, we can even save the row size updating. 
Then we can apply this technique to more places easily.
    
    a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this 
PR:
    **old version**
    ```
    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
    unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative 
Rate
    
-------------------------------------------------------------------------------
    single long                             2616.04           102.61         
1.00 X
    single nullable long                    3032.54            88.52         
0.86 X
    primitive types                         9121.05            29.43         
0.29 X
    nullable primitive types               12410.60            21.63         
0.21 X
    ```
    
    **new version**
    ```
    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
    unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative 
Rate
    
-------------------------------------------------------------------------------
    single long                             1533.34           175.07         
1.00 X
    single nullable long                    2306.73           116.37         
0.66 X
    primitive types                         8403.93            31.94         
0.18 X
    nullable primitive types               12448.39            21.56         
0.12 X
    ```
    
    For single non-nullable int(the best case), we can have about 1.7x speed 
up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's 
not such a boost as the saved operations only take a little proportion of the 
all process.  The benchmark code is included in this PR.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark unsafe-projection

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10809.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10809
    
----
commit 397871117179c31c5a634c96165e8cf934316291
Author: Wenchen Fan <[email protected]>
Date:   2016-01-17T23:16:06Z

    improve the unsafe row writing framework

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to