GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/10809
[SPARK-12879][SQL] improve the unsafe row writing framework
As we begin to use unsafe row writing framework(`BufferHolder` and
`UnsafeRowWriter`) in more and more places(`UnsafeProjection`,
`UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add
more doc to it and make it easier to use.
This PR abstract the technique used in `UnsafeRowParquetRecordReader`:
avoid unnecessary operatition as more as possible. For example, do not always
point the row to the buffer at the end, we only need to update the size of row.
If all fields are of primitive type, we can even save the row size updating.
Then we can apply this technique to more places easily.
a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this
PR:
**old version**
```
Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative
Rate
-------------------------------------------------------------------------------
single long 2616.04 102.61
1.00 X
single nullable long 3032.54 88.52
0.86 X
primitive types 9121.05 29.43
0.29 X
nullable primitive types 12410.60 21.63
0.21 X
```
**new version**
```
Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative
Rate
-------------------------------------------------------------------------------
single long 1533.34 175.07
1.00 X
single nullable long 2306.73 116.37
0.66 X
primitive types 8403.93 31.94
0.18 X
nullable primitive types 12448.39 21.56
0.12 X
```
For single non-nullable int(the best case), we can have about 1.7x speed
up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's
not such a boost as the saved operations only take a little proportion of the
all process. The benchmark code is included in this PR.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark unsafe-projection
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10809.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10809
----
commit 397871117179c31c5a634c96165e8cf934316291
Author: Wenchen Fan <[email protected]>
Date: 2016-01-17T23:16:06Z
improve the unsafe row writing framework
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]