Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/295#issuecomment-39397158
  
    It's described a little in the JIRA, but off the top of my head these are 
the high points:
     - Do all calls to `transform` on the input expressions once (probably per 
partition?  I'm not exactly sure how `@transient lazy val`s work with spark and 
serialization) instead of once per group / output row.
     - Use `MutableRow` where ever possible.
     - Use the `Projection` and `MutableProjection` interface (which internally 
uses while loops / raw arrays, and will make switching to code gen easier).
     - Avoid `.map` and `.foreach` in the critical path (they used all over for 
setup though, which I think is okay) and instead write while loops.
     - Build the hash table that holds aggregate buffers in a streaming 
fashion, instead of doing one big `groupBy` per-partition and then working on 
the materialized groups.
     - Stream the output through a custom `Iterator` instead of materializing 
it all and then calling `.toIterator`
     - Special case operations where there are no aggregate expressions (Don't 
build a hash table at all).  The critical thing here is, before we were calling 
`.count()` and then doing something if that came back empty (you always output 
a single row when there are no grouping expressions, even if there is no 
input).  Now, we don't double compute the child RDD or break lineage in these 
cases.
    
    Basically... we were not being very careful about performance at all before 
;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to