GitHub user ueshin opened a pull request:
https://github.com/apache/spark/pull/18989
[SPARK-21781][SQL] Modify DataSourceScanExec to use concrete ColumnVector
type.
## What changes were proposed in this pull request?
As mentioned at
https://github.com/apache/spark/pull/18680#issuecomment-316820409, when we have
more `ColumnVector` implementations, it might (or might not) have huge
performance implications because it might disable inlining, or force virtual
dispatches.
As for read path, one of the major paths is the one generated by
`ColumnBatchScan`. Currently it refers `ColumnVector` so the penalty will be
bigger as we have more classes, but we can know the concrete type from its
usage, e.g. vectorized Parquet reader uses `OnHeapColumnVector`. We can use the
concrete type in the generated code directly to avoid the penalty.
## How was this patch tested?
Existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ueshin/apache-spark issues/SPARK-21781
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18989.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18989
----
commit 6f19db753a4cba36f92bd225eb63b60080158054
Author: Takuya UESHIN <[email protected]>
Date: 2017-08-18T04:53:00Z
Modify DataSourceScanExec to use concrete ColumnVector type.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]