GitHub user heary-cao opened a pull request:

    https://github.com/apache/spark/pull/18969

    [SPARK-21520][SQL][FOLLOW-UP]fix a special case for non-deterministic 
projects in optimizer

    ## What changes were proposed in this pull request?
    
    This is a follow-up of #18892 , to another fix it:
    Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic projects. Deal with that we only need to read user needs 
fields for non-deterministic projects in optimizer.
     For example, the fields of project contains nondeterministic function(rand 
function), after a executedPlan optimizer generated:
    ```
    *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
bigint))], output=[k#403L, sum#800L])
    +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
10000.0)) AS k#403L]
       +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
    ```
    HiveTableScan will read all the fields from table. but we only need to 
‘d004’ . it will affect the performance of task.
    
    
    ## How was this patch tested?
    Should be covered existing test cases and add test cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/heary-cao/spark followup-non-deterministic

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18969.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18969
    
----
commit e84425f16c868844f442ff5b7cd8aa7695a94038
Author: caoxuewen <[email protected]>
Date:   2017-08-17T07:12:43Z

    fix a special case for non-deterministic projects in optimizer

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to