[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

rxin Mon, 09 Mar 2015 10:39:09 -0700

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-77903115
  
    Serializer seems ok to add.
    
    One thing I am not sure about is the mapSideCombine thing -- I'm never a 
fan of that parameter even though I added it myself, for the following reasons:
    
    1. mapSideCombine is a MR term used in Hive that doesn't mean much outside 
of MR. A more proper name is partialAggregation.
    2. The underlying implementation should be able to avoid partial 
aggregation if it finds that partial aggregation is expensive (i.e. after 
trying 10000 records, check whether the hash table size is less than a specific 
threshold). It is one of the things we can easily auto tune.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Reply via email to