[GitHub] spark pull request: [SPARK-2873] [SQL] using ExternalAppendOnlyMap...

guowei2 Tue, 26 Aug 2014 07:56:46 -0700

Github user guowei2 commented on the pull request:

    https://github.com/apache/spark/pull/2029#issuecomment-53433705
  
    @marmbrus 
    
    it's very sad about the result of benchmark above.
    once one spill happen, usually batch of spills will happen one by one.
    
    the size of AppendOnlyMap is according to the number of keys for values 
with the same key merged
    
    i think it's not a good way by using ExternalAppendOnlyMap,fot it is too 
expensive when records with the same key spill to disk over and over again.
    
    otherwise, user can easily avoid OOM by raising 
spark.sql.shuffle.partitions to reduce the key numbsers
    
    i think the logic of ExternalAppendOnlyMap should  Optimize.
    
    join seems have similar problems. meanwhile, both left and right table put 
into ExternalAppendOnlyMap is expensive too



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2873] [SQL] using ExternalAppendOnlyMap...

Reply via email to