Github user squito commented on the issue:
https://github.com/apache/spark/pull/16867
more brainstorming:
(1) you could lazily update your median collection (whether its a treeset
or median heap). First you'd just dump tasks into an array, and then when you
query for the median, you'd update your treeset or heap with the new data.
this has the properties we want, but it adds a lot of complexity, so I'm really
not sure if its worth it.
(2) sampling. Getting the exact median here is not important. You could
keep a reservoir sample of 100 tasks, and take the median of those. This would
be fast & simple.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]