Rhino wrote:
Well, I think you've just succeeded in demolishing my wonderful example; it turns out that we don't need to use a stored procedure to find a median after all ;-)
That wasn't my intention. I thought I was adding support to your example. Just because it can be done in SQL doesn't necessarily mean it should be.
You're right that the algorithm I described in my original reply is a bit simplified and assumes an odd number of rows; it doesn't handle the case where the number of rows is even. I assume that was just for the convenience of the person who wrote the course materials I was teaching; they didn't want to get bogged down in the subtleties of the details of calculating a median.
I was specifically responding to Wolfram's suggested solution "select ... limit count/2, 1", rather than your description which didn't really include the algorithm, but this leads directly to Shawn's point about the SP version helping the end user by shielding him/her from the details. This is at least the third time the subject of medians has been discussed on this list in recent memory. In each case, most proposed solutions were incorrect, because they were based on the flawed assumption that there is always a middle value, an assumption which is false roughly half the time (n even). If you write a correct solution, you not only make the end-user's life easier, you protect him/her from getting the wrong answer.
I have to admit I've never seen an SQL query that would compute a median before. I'm not sure I completely understand your query, particularly the GROUP BY and HAVING clauses - I know what GROUP BY and HAVING do in general, I'm just not sure what they are accomplishing in *this* case - but you're a mathematician so I'll assume that the query is accurate and will work for both odd and even numbered sets of rows ;-)
I came up with this solution almost exactly a year ago in another median thread <http://lists.mysql.com/mysql/155528>. I got the idea from an *incorrect* solution in O'Reilly's "Transact-SQL Cookbook" <http://www.oreilly.com/catalog/transqlcook/chapter/ch08.html>.
The median is often described as the "middle value", but that is a slight simplification. The median, which I'll call m, has the following properties:
* At least 50% of the values are <= m * At least 50% of the values are >= m
The "at least" part makes sense when you consider the possibility of repeated values in the middle.
I'll repeat the simpler of the two solutions (median of all values in one column) and explain it:
CREATE TEMPORARY TABLE medians SELECT x.val medians FROM data x, data y GROUP BY x.val HAVING SUM(y.val <= x.val) >= COUNT(*)/2 AND SUM(y.val >= x.val) >= COUNT(*)/2;
SELECT AVG(medians) AS median FROM medians;
DROP TABLE medians;
We join the table to itself, with *no* join condition (Cartesian product). For each value on the left (GROUP BY x.val), we get a row for every value on the right (y.val). We count how many of the rows have values on the right which are less than or equal to x.val {SUM(y.val <= x.val)}, and how many have values which are greater than or equal to x.val {SUM(y.val >= x.val)}. We only accept rows (HAVING) where both counts are at least 50% of the total rows. In the odd rows case, this can only be satisfied by the value in the middle, in the even case, this can only be satisfied by the two values on either side of the middle. In both cases, the average of the result(s) is the median.
This works, but it is hardly efficient. We create a Cartesian product in order to get 1 or 2 rows. The programmatic solution is surely more efficient (pseudocode):
#get the number of rows, N "SELECT COUNT(*) FROM data;" # fast with MyISAM, slow with InnoDB
If N is odd { # median is the middle value # middle position is (N+1)/2, starting with row 1, # but LIMIT starts with row 0, so use (N-1)/2 offset = (N-1)/2 "SELECT val FROM data ORDER BY val LIMIT offset, 1;" return val as median } Else #N is even { # median is average of middle 2 values # middle 2 positions are N/2 and N/2 + 1, starting from row 1, # but LIMIT starts with row 0, so use N/2 - 1 and N/2 offset = (N/2) - 1 "SELECT val FROM data ORDER BY val LIMIT offset, 2;" return (val from row 1 + val from row 2)/2 as median }
but it cannot be done in SQL (no flow control, can't use user variables in LIMIT). That leaves client side or SP, I think.
It looks like I'll have to come up with a more bulletproof example of a stored procedure before I next teach the concepts.
You'll have to be the judge, but it still seems a good example to me. Median is a relatively simple concept, yet most people get the formula wrong (even in a published book!), so it is perhaps best not left to the client. The straightforward, efficient solution requires flow control, so a stored procedure seems appropriate.
Rhino
Michael
-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]