Yep, that's clear. That's a reasonable case. There are already approximate median computations that can be done cumulatively as you say, implemented in Spark. I think it's reasonable to consider this for performance, as it can be faster with just a small error tolerance. But yeah up to you if you have better ideas.
On Wed, Nov 27, 2019 at 7:57 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > > Thanks all for providing inputs! Maybe I wasn't clear about my intention. > > The issue I focus on is; there're plenty of metrics being defined in a stage > for SQL, and each metric has values for each task and being grouped later to > calculate aggregated values. (e.g. metric for "elapsed time" is shown in UI > as sum, min, med, max - which source values come from each task) > > Due to the nature of exact calculation of "median", we can't apply > accumulation - we are now storing all values for all metrics till the end of > stage. Given the default value of sql shuffle partition is 200, a stage would > have 200 tasks when we deal with shuffle (grouping, join, etc.). If we have > 50 metrics in a stage, 10000 Long values are maintained in driver side which > may ideally just need to be 50 * number of aggregation (at most 4) if all of > aggregations support accumulation. So I'm wondering something which could > support accumulation and closer to median. (I guess it's intentional to not > take average here so...) > > What's more on SQLAppStatusListener, they're calculated altogether at the end > of SQL execution, which may contain multiple jobs. > (Oh wait... Hmm... Looks like I missed the another point of optimization here > which might mitigate the issue heavily... so please treat my idea as rough > idea just for possible optimization.) > > But again that's very rough idea, and it won't make sense if the expected > output is not acceptable as representation. > > -Jungtaek Lim (HeartSaVioR) > > > On Wed, Nov 27, 2019 at 11:25 PM Sean Owen <sro...@gmail.com> wrote: >> >> How big is the overhead, at scale? >> If it has a non-trivial effect for most jobs, I could imagine reusing >> the existing approximate quantile support to more efficiently find a >> pretty-close median. >> >> On Wed, Nov 27, 2019 at 3:55 AM Jungtaek Lim >> <kabhwan.opensou...@gmail.com> wrote: >> > >> > Hi Spark devs, >> > >> > The change might be specific to the SQLAppStatusListener, but given it may >> > change the value of metric being shown in UI, so would like to hear some >> > voices on this. >> > >> > When we aggregate the SQL metric between tasks, we apply "sum", "min", >> > "median", "max", which all are cumulative except "median". That's >> > different from "average" given it helps to get rid of outliers, but if >> > that's the only purpose, it may not strictly need to have exact value of >> > median. >> > >> > I'm not sure how much the value is losing the meaning of representation, >> > but if it doesn't hurt much, what about taking median of medians? For >> > example, taking median of nearest 10 tasks and store it as one of median >> > values, and finally taking median of medians. If I calculate correctly, >> > that would only require 11% of slots if the number of tasks is 100, and >> > replace sorting 100 elements with sorting 10 elements 11 times. The >> > difference would be bigger if the number of tasks is bigger. >> > >> > Just a rough idea so any feedbacks are appreciated. >> > >> > Thanks, >> > Jungtaek Lim (HeartSaVioR) --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org