Another option could be to use a sketch to get approx median(extendable to quantiles as well) for a large number of tasks sketch would give accurate value as tasks are few, for larger task the benefit will be good. Regards, Mayur Rustagi Ph: +1 (650) 937 9673 http://www.sigmoid.com <http://www.sigmoidanalytics.com/> @mayur_rustagi <http://www.twitter.com/mayur_rustagi>
On Wed, Nov 27, 2019 at 3:25 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > Hi Spark devs, > > The change might be specific to the SQLAppStatusListener, but given it may > change the value of metric being shown in UI, so would like to hear some > voices on this. > > When we aggregate the SQL metric between tasks, we apply "sum", "min", > "median", "max", which all are cumulative except "median". That's different > from "average" given it helps to get rid of outliers, but if that's the > only purpose, it may not strictly need to have exact value of median. > > I'm not sure how much the value is losing the meaning of representation, > but if it doesn't hurt much, what about taking median of medians? For > example, taking median of nearest 10 tasks and store it as one of median > values, and finally taking median of medians. If I calculate correctly, > that would only require 11% of slots if the number of tasks is 100, and > replace sorting 100 elements with sorting 10 elements 11 times. The > difference would be bigger if the number of tasks is bigger. > > Just a rough idea so any feedbacks are appreciated. > > Thanks, > Jungtaek Lim (HeartSaVioR) >