Github user viirya commented on the pull request:
https://github.com/apache/spark/pull/9067#issuecomment-151793559
@rxin I ran a simple performance measure as following.
Record count: 1333635318
Record after group by: 259200
SQL query looks like: `SELECT SUM(a) as a , SUM(b) as b , SUM(c) as c ,
SUM(d) as d from table GROUP BY e`
4 workers (8 cores), executor memory: 512 MB.
With pre-aggregation enabled:
67720191 microseconds
66424539 microseconds
62959275 microseconds
With pre-aggregation disabled:
69934956 microseconds
70351959 microseconds
68437353 microseconds
So looks like it roughly gains about 5% improvement in average.
Not very significant, but the reduction factor is not high, so it should be
expected.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]