GitHub user heary-cao opened a pull request:
https://github.com/apache/spark/pull/20415
[SPARK-23247][SQL]combines Unsafe operations and statistics operations in
Scan Data Source
## What changes were proposed in this pull request?
Currently, we scan the execution plan of the data source, first the unsafe
operation of each row of data, and then re traverse the data for the count of
rows. In terms of performance, this is not necessary. this PR combines the two
operations and makes statistics on the number of rows while performing the
unsafe operation.
Before modified,
```
val unsafeRow = rdd.mapPartitionsWithIndexInternal { (index, iter) =>
val proj = UnsafeProjection.create(schema)
proj.initialize(index)
iter.map(proj)
}
val numOutputRows = longMetric("numOutputRows")
unsafeRow.map { r =>
numOutputRows += 1
r
}
```
After modified,
val numOutputRows = longMetric("numOutputRows")
rdd.mapPartitionsWithIndexInternal { (index, iter) =>
val proj = UnsafeProjection.create(schema)
proj.initialize(index)
iter.map( r => {
numOutputRows += 1
proj(r)
})
}
## How was this patch tested?
the existed test cases.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/heary-cao/spark DataSourceScanExec
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20415.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20415
----
commit e3e09d98072bd39328a4e7d4de1ddd38594c6232
Author: caoxuewen <cao.xuewen@...>
Date: 2018-01-27T06:27:37Z
combines Unsafe operations and statistics operations in Scan Data Source
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]