Github user aray commented on the issue:

    https://github.com/apache/spark/pull/18307
  
    @rxin Yes it slows things down quite a bit. Informal testing on 10M row 2 
column synthetic data puts this implementation at around 10s vs 0.5s in 
2.2-rc4. I can speed it up some by doing only a single `percentile_approx` 
aggregate per column (with array of percentiles and then unpacking afterwards).
    
    To give users an option we could mirror [pandas 
describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)
 and make the percentiles an optional parameter with default [.25, .5, .75]. If 
someone wanted faster results they could just specify an empty array of 
percentiles.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to