[GitHub] spark pull request #15135: [pyspark][group]pyspark GroupedData can't apply a...

citoubest Sat, 17 Sep 2016 23:02:01 -0700

GitHub user citoubest opened a pull request:

    https://github.com/apache/spark/pull/15135


    [pyspark][group]pyspark GroupedData can't apply agg functions on all left 
numeric columns.

    ## What changes were proposed in this pull request?
    With pyspark dataframe, the agg method just support two ways, one is to 
give the column and agg method maps and another one is to use agg functions in 
package functions  to apply on specific columns names. The two approach both 
ask us to asign a method on specific columns names. But if
    I want to apply the agg method on all other numeric columns, I should list 
all method-column combinations. such as, suppose the df has to columns 
province,age,income, I want to groupby the province and calculate the min, max 
and average values on age and income, before change I have to approach: 
df.groupby('province').agg({'age':'max','age':'min','age':'avg','income':'max','income':'min','income':'avg'})
    
df.groupby('province').agg({F.min('age'),F.max('age'),F.avg('age'),F.min('income'),F.max('income'),F.avg('income')}),
 which are both redundant.
    
    with this change, we can simply replace the code with 
df.groupby('province').agg('max','min','avg')
    
    ## How was this patch tested?
    manual tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/citoubest/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15135.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15135
    
----
commit 67e75a201bd85058e2b5a843b061e00cb3856087
Author: citoubest <[email protected]>
Date:   2016-09-18T05:37:40Z

    pyspark dataframe agg not support multiple functions for all columns, 
change group.agg to support df.groupby(name).agg(max,min) like pandas

commit 7407bc84b0376c78a7816de349a607cecac1d6f4
Author: citoubest <[email protected]>
Date:   2016-09-18T05:39:25Z

    add comment for last change

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15135: [pyspark][group]pyspark GroupedData can't apply a...

Reply via email to