GitHub user heary-cao reopened a pull request:
https://github.com/apache/spark/pull/18918
[SPARK-21707][SQL]Improvement a special case for non-deterministic filters
in optimizer
## What changes were proposed in this pull request?
Currently, Did a lot of special handling for non-deterministic projects and
filters in optimizer. but not good enough. this patch add a new special case
for non-deterministic filters. Deal with that we only need to read user needs
fields for non-deterministic filters in optimizer.
For example, the condition of filters is nondeterministic. e.g:contains
nondeterministic function(rand function), HiveTableScans optimizer generated:
```
HiveTableScans plan:Aggregate [k#2L], [k#2L, k#2L, sum(cast(id#1 as
bigint)) AS sum(id)#395L]
+- Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L]
+- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5))
&& NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0))
+- MetastoreRelation XXX_database, XXX_table
HiveTableScans plan:Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L]
+- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) &&
NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0))
+- MetastoreRelation XXX_database, XXX_table
HiveTableScans plan:Filter ((isnotnull(d004#205) &&
(rand(-4530215890880734772) <= 0.5)) && NOT (cast(cast(d004#205 as
decimal(10,0)) as decimal(11,1)) = 0.0))
+- MetastoreRelation XXX_database, XXX_table
HiveTableScans plan:MetastoreRelation XXX_database, XXX_table
HiveTableScans result plan:HiveTableScan [c030#204L, d004#205, d005#206,
d025#207, c002#208, d023#209, d024#210, c005#211L, c008#212, c009#213,
c010#214, d021#215, d022#216, c017#217, c018#218, c019#219, c020#220, c021#221,
c022#222, c023#223, c024#224, c025#225, c026#226, c027#227, ... 169 more
fields], MetastoreRelation XXX_database, XXX_table
```
so HiveTableScan will read all the fields from table. but we only need to
âd004â and 'c010' . it will affect the performance of task.
## How was this patch tested?
Should be covered existing test cases and add new test cases.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/heary-cao/spark filters_non_deterministic
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18918.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18918
----
commit 97a32709f40c573bada4c46df0d00aad14425ee2
Author: caoxuewen <[email protected]>
Date: 2017-08-11T09:56:55Z
Improvement a special case for non-deterministic filters in optimizer
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]