How big is the overhead, at scale?
If it has a non-trivial effect for most jobs, I could imagine reusing
the existing approximate quantile support to more efficiently find a
pretty-close median.

On Wed, Nov 27, 2019 at 3:55 AM Jungtaek Lim
<kabhwan.opensou...@gmail.com> wrote:
>
> Hi Spark devs,
>
> The change might be specific to the SQLAppStatusListener, but given it may 
> change the value of metric being shown in UI, so would like to hear some 
> voices on this.
>
> When we aggregate the SQL metric between tasks, we apply "sum", "min", 
> "median", "max", which all are cumulative except "median". That's different 
> from "average" given it helps to get rid of outliers, but if that's the only 
> purpose, it may not strictly need to have exact value of median.
>
> I'm not sure how much the value is losing the meaning of representation, but 
> if it doesn't hurt much, what about taking median of medians? For example, 
> taking median of nearest 10 tasks and store it as one of median values, and 
> finally taking median of medians. If I calculate correctly, that would only 
> require 11% of slots if the number of tasks is 100, and replace sorting 100 
> elements with sorting 10 elements 11 times. The difference would be bigger if 
> the number of tasks is bigger.
>
> Just a rough idea so any feedbacks are appreciated.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to