GitHub user heary-cao opened a pull request:
https://github.com/apache/spark/pull/18969
[SPARK-21520][SQL][FOLLOW-UP]fix a special case for non-deterministic
projects in optimizer
## What changes were proposed in this pull request?
This is a follow-up of #18892 , to another fix it:
Currently, Did a lot of special handling for non-deterministic projects and
filters in optimizer. but not good enough. this patch add a new special case
for non-deterministic projects. Deal with that we only need to read user needs
fields for non-deterministic projects in optimizer.
For example, the fields of project contains nondeterministic function(rand
function), after a executedPlan optimizer generated:
```
*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as
bigint))], output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) *
10000.0)) AS k#403L]
+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610,
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617,
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625,
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation
XXX_database, XXX_table
```
HiveTableScan will read all the fields from table. but we only need to
âd004â . it will affect the performance of task.
## How was this patch tested?
Should be covered existing test cases and add test cases.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/heary-cao/spark followup-non-deterministic
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18969.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18969
----
commit e84425f16c868844f442ff5b7cd8aa7695a94038
Author: caoxuewen <[email protected]>
Date: 2017-08-17T07:12:43Z
fix a special case for non-deterministic projects in optimizer
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]