Re: [sc-dev] Re: Median: Oasis and Fast Sorting Algorithm

Leonard Mada Sun, 18 Feb 2007 12:14:38 -0800

Sorry for this late reply.

1. MEDIAN - NEW DEFINITION
2. MEDIAN - ALGORITHM

1. The issue that I raised was about the definition of the median in theOASIS document, which was ambiguous and inaccurate. This definition isimportant, because there might exist other medians beyond what OASISintended to define.

I point out the use by many signal processing filters and highermathematics of a weighted median: e.g. the median of 1,2,2,3,4,5 is NOT(2+3)/2, BUT rather (2+2+3)/3 when using this weighted median.Similarly, when using defined spaces, you can NOT have a median of(2+3)/2 = 2.5 inside the space {1,2,3}, because this space does NOT havesuch a value. In this latter instance, it is more appropriate to talkabout a lower and an upper median.

There was a heated discussion on the gnumeric list on these issues. ProfNash provided a very insightful comment and offered an alternative tothe median definition, where *"sorted"* was replaced by *"ranked"*. Ibelieve this is a far better wording and I can live with it.

I append as a quotation that comment and hope that this betterdefinition will replace the ugly one. Please note, that gnumeric has afunction "ssmedian", too (which is a weighted median). So it is possiblethat the OASIS document will include different medians in the future,too, making the old definition very ambiguous.

2. My algorithm had some flaws, specifically IF the insert inside thearray involved moving parts of the array, then the speed would havedecreased dramatically (it would have approached O(1/4 * n^2)). Morecomplex data elements could have improved the insert, BUT would havepenalized any other operation on the array (I did NOT test this).

Nevertheless, the existent algorithms to detect the median are 20-50times *faster than sort algorithms.* Such an algorithm is especiallyuseful for large data sets OR multiple instances of median calculation.Therefore, I believe that such an algorithm has a place inside Calc,too. (See http://ndevilla.free.fr/median/median/node20.html for the fastalgorithms).


Kind regards,

Leonard Mada

Prof J C Nash wrote:

The rather heated discussion on the median has a historic quality. It isnot quite as old as the arguments about how many angels can sit on thehead of a pin, but not too far off. Perhaps I can cool things down by
providing some perspective.
There are, to the eyes of someone who has been teaching and working withthis stuff for nearly 4 decades, three issues that are causing conflict:
1) The question of what we actually want "MEDIAN" to compute.
2) The question of how to express this clearly.
3) The question of how to actually do it.
Question (1) concerns the allowed values for the quantity returned by afunction of type "quantile", the general term for a median, percentile,or any other quantity that returns the "value in the scale of the datawith rank k from the minimum (or, alternatively, from the maximum".
We can insist on returning values from within the original data set, inwhich case we must sometimes choose between two values, or we canspecify a rule for interpolation.
The most common rule for the median is to interpolate when the number ofelements, n, is even, leading to the "definition" that the median is theaverage of the middle ranked values for this case, while it takes thevalue of the middle rank datum when n is odd. I believe all thespreadsheets use this rule. Percentiles are usually (but not always)computed using interpolation, and I have seen some variety in the valuesgiven for quartiles in different statistical packages. Rob Hyndmanpublished a quite good paper in The American Statistician about 8-10years ago on the subject.
John Tukey (stem and leaf graphs, box and whisker plots, five numbersummaries, Exploratory Data Analysis, etc.) liked to have measures thatbelonged to the data set. So he defined Depth as the rank of anobservation from the nearest "end" (highest or lowest) and Tukeyboxplots use Hinges rather than Quartiles (which are generallyinterpolated, but not always by consistent formulas).
See http://macnash.admin.uottawa.ca/files/stembox.txt for a bit of adiscussion on Depths etc.
Ultimately, this question boils down to where to cut to divide 4 candiesamong 5 children. No matter what you do, things get ugly. For large n,whether you use interpolation, and what formula for interpolating,becomes less important.
When we come to (2), I am surprised nobody is using RANK rather thanSORT. To compute quantiles, one does need to rank the data in some way,but we don't actually need to sort it. There were a number of articlesabout 30 years ago on ranking vs sorting, and if anyone is interested Ican try to dig some up. If I were writing the MEDIAN definition for theTC, I would change it in one word to:
MEDIAN logically _ranks_ the numbers (lowest to highest). If given anodd number of values, MEDIAN returns the middle value. If given an even
number of values, MEDIAN returns the arithmetic average of the two
middle values.
In other words, I'll live with the interpolation. Actually, I believeTukey (someone correct me if I'm wrong, my copy of EDA is in anotheroffice) defined median as the value with maximum depth from either endor the arithmetic average of the two values having maximal depth if thevalue is not unique.
Question (3) is one for the programmers and algorithm performancepeople. I think a fast ranking algorithm would be very useful to improveperformance in a lot of spreadsheet features, not just the median. Butit should be an algorithm that is quite "clean" so that application tomore than just a column or row range doesn't end up causing all kinds ofpotential for bugs. For the sake of development and testing, it may beuseful to use compile or even execution switches based on some storedcontrol parameter, so that informed users can try out different rankingalgorithms. Note that this may be helpful for increasing sortperformance on very large sheets.
Note that RANK() gives the rank of a number in a set. We actually wantthe number that is ranked at some level.
No doubt the discussion will continue.

JN

_______________________________________________
gnumeric-list mailing list
[email protected]
http://mail.gnome.org/mailman/listinfo/gnumeric-list


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [sc-dev] Re: Median: Oasis and Fast Sorting Algorithm

Reply via email to