Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/295#issuecomment-39397158 It's described a little in the JIRA, but off the top of my head these are the high points: - Do all calls to `transform` on the input expressions once (probably per partition? I'm not exactly sure how `@transient lazy val`s work with spark and serialization) instead of once per group / output row. - Use `MutableRow` where ever possible. - Use the `Projection` and `MutableProjection` interface (which internally uses while loops / raw arrays, and will make switching to code gen easier). - Avoid `.map` and `.foreach` in the critical path (they used all over for setup though, which I think is okay) and instead write while loops. - Build the hash table that holds aggregate buffers in a streaming fashion, instead of doing one big `groupBy` per-partition and then working on the materialized groups. - Stream the output through a custom `Iterator` instead of materializing it all and then calling `.toIterator` - Special case operations where there are no aggregate expressions (Don't build a hash table at all). The critical thing here is, before we were calling `.count()` and then doing something if that came back empty (you always output a single row when there are no grouping expressions, even if there is no input). Now, we don't double compute the child RDD or break lineage in these cases. Basically... we were not being very careful about performance at all before ;)
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---