[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

squito Fri, 03 Mar 2017 12:26:52 -0800

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/16867
  
    more brainstorming:
    
    (1) you could lazily update your median collection (whether its a treeset 
or median heap).  First you'd just dump tasks into an array, and then when you 
query for the median, you'd update your treeset or heap with the new data.  
this has the properties we want, but it adds a lot of complexity, so I'm really 
not sure if its worth it.
    
    (2) sampling.  Getting the exact median here is not important.  You could 
keep a reservoir sample of 100 tasks, and take the median of those.  This would 
be fast & simple.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

Reply via email to