RE: UDF request?

SGreen Mon, 11 Apr 2005 06:29:48 -0700

Dan Bolser <[EMAIL PROTECTED]> wrote on 04/08/2005 06:45:42 PM:

> On Fri, 8 Apr 2005 [EMAIL PROTECTED] wrote:
> 
> >Dan Bolser <[EMAIL PROTECTED]> wrote on 04/08/2005 12:41:35 PM:
> >
> >> On Thu, 7 Apr 2005, Sean Nolan wrote:
> >> 
> >> >I think you'll find you can do what you want with a cross join. A 
cross 
> >join 
> >> >will join every row from the first table with every row from the 
second 
> >
> >> >table. It will not randomly do this, so you'd have to be creative in


> >> >"randomly" selecting rows in the WHERE clause. Here's how you could 
do 
> >this 
> >> >with your data, pardon the highly original and very scientific 
> >> >randomization, but it is just an example :-)
> >> 
> >> Perhaps I don't understand your "randomization" (because I really 
don't
> >> understand it ;), but I don't think a CROSS join does the trick, 
because 
> >I
> >> want to randomly pick the same rows from either side of the join that 
I
> >> would have gotten with an INNER JOIN using the 1:1 primary key 
between 
> >the
> >> two tables. This is analagous to sampling without replacement. If I 
do a
> >> cross join and then just restrict the number of rows returned (hey! I 

> >just
> >> worked out what you are doing below) I don't necessarily get the same
> >> 'marginals' (or to randomly pick the same rows from either side of 
the
> >> join) . This is analagous to sampling with replacement.
> >> 
> >> Do you see what I mean?
> >> 
> >> 
> >> 
> >> 
> >> >
> >> >SELECT
> >> >  Dept,
> >> >  Gender,
> >> >  AVG(Spending)
> >> >FROM
> >> >  table_one a
> >> >CROSS JOIN
> >> >  table_two b
> >> >WHERE (a.Person * b.Person) % 4 = 3
> >> >GROUP BY
> >> >  Dept, Gender;
> >> >
> >> >Sean
> >> >
> >> >----- Original Message ------
> >> >
> >> >Hi,
> >> >
> >> >I have a request for a UDF that I would find really useful. I am
> >> >duplicating this question on the community list as well as the MySQL 

> >list,
> >> >because I am not sure where best to make this kind of request (see 
the
> >> >previous post http://lists.mysql.com/community/97).
> >> >
> >> >I think the following syntax would be really cool...
> >> >
> >> >SELECT
> >> >  a.*,
> >> >  b.*
> >> >FROM
> >> >  table_one a
> >> >RANDOM JOIN       -- <-- :)
> >> >  table_two b
> >> >USING
> >> >  (ID)
> >> >;
> >> >
> >> >Lets say table_one and table_two have a primary key called ID in 
common
> >> >(although of course that isn't necessary).
> >> >
> >> >The idea is that the RANDOM JOIN would first calculate the INNER 
JOIN,
> >> >then scramble the 'links' (or rows) between the two tables. The 
result
> >> >would be the same number of rows overall, the same number of 
distinct
> >> >a.ID's and b.ID's, but those a.ID's and b.ID's would be randomly
> >> >associated with the marginals given by the correct inner join.
> >> >
> >> >Hopefully that makes sense.
> >> >
> >> >I think this function would be really useful for statistical 
analysis 
> >of
> >> >scientific data within MySQL (using randomized versions of the
> >> >associations within the tables).
> >> >
> >> >Not sure if the above syntax is optimal, because I would like to be 
> >able
> >> >to do things like this...
> >> >
> >> >table_one
> >> >Dept   Person   Gender
> >> >A   1   M
> >> >A   2   F
> >> >A   3   M
> >> >B   4   F
> >> >B   5   M
> >> >B   6   F
> >> >
> >> >table_two
> >> >Dept    Person  Spending
> >> >A       1       10
> >> >A       2       20
> >> >A       3       30
> >> >B       4       40
> >> >B       5       50
> >> >B       6       60
> >> >
> >> >SELECT
> >> >  Dept,
> >> >  Gender,
> >> >  AVG(Spending)
> >> >FROM
> >> >  table_one a
> >> >INNER JOIN
> >> >  table_two b
> >> >USING
> >> >  (Dept,Person)
> >> >GROUP BY
> >> >  Dept, Gender
> >> >;
> >> >
> >> >
> >> >With the above query (which I hope is sensible) I would like to keep 

> >the
> >> >departments fixed, and randomize the genders of the people in the
> >> >departments (keeping the number of each sex in each department the 
> >same).
> >> >
> >> >So we could INNER JOIN using Dept and then RANDOM JOIN (as described
> >> >above) using Person - all in one 'JOIN'. All else being the same 
this
> >> >should randomize the Gender, but keep the marginals.
> >> >
> >> >I guess this is overly complex given the problem, and it actually 
> >raises
> >> >more questions instantly (about statistical analysis), but the 
function 
> >is
> >> >basic (I hope), and like all SQL it is the simplicity that allows 
you 
> >to
> >> >build complex and robust statements.
> >> >
> >> >Does the random join make any sense? Is this UDF material?
> >> >
> >> >Any feedback is welcome,
> >> >
> >> >Dan.
> >> >
> >> >Thread
> >> >UDF request? - Dan Bolser, April 1 2005 12:10am
> >> >
> >> >_________________________________________________________________
> >> >Don?t just search. Find. Check out the new MSN Search! 
> >> >http://search.msn.click-url.com/go/onm00200636ave/direct/01/
> >> >
> >> >
> >> >
> >> 
> >I think I get it. It sounds to me like you want to take a random 
sampling 
> >(possibly to include the entire set) and analyze each sample. You can 
> >return the results of a query in random order if you use the RAND() 
> >function:
> >
> >http://dev.mysql.com/doc/mysql/en/mathematical-functions.html
> >
> ><quote from site>
> >As of MySQL 3.23, you can retrieve rows in random order like this:
> >
> >mysql> SELECT * FROM tbl_name ORDER BY RAND();
> >
> >ORDER BY RAND() combined with LIMIT is useful for selecting a random 
> >sample of a set of rows:
> >
> >mysql> SELECT * FROM table1, table2 WHERE a=b AND c<d
> >    -> ORDER BY RAND() LIMIT 1000;
> ><end quote>
> >
> >So if you want to randomly pick 1900 rows of data from a query you 
would 
> >say (you need to fill in the blanks with your query):
> >
> >SELECT ...
> >FROM ...
> >WHERE ...
> >ORDER BY RAND()
> >LIMIT 1900;
> >
> >As to randomly matching rows from one table with corresponding rows 
from a 
> >second table, that's just an illusion. For any two tables JOINed 
together, 
> >the same row combinations will exist regardless of order or sequence.
> >
> >TABLE A
> >1 red
> >2 blue
> >3 green
> >
> >Table B
> >1 ball
> >2 horse
> >3 wagon
> >
> >SELECT CONCAT(a.color,' ', b.item)
> >FROM tablea a
> >INNER JOIN tableb b
> >ORDER BY RAND();
> >
> >I just made a cross product (Cartesian product/cross join) of TableA 
with 
> >TableB. There will be 9 rows in some random order (as determined by the 

> >engine). However, if we sort the results (outside of MySQL) we would 
> >always get the same list in the same order because there would always 
be 3 
> >red items (red ball, red horse, red wagon) and 3 blue items (blue ball, 

> >blue horse, blue wagon) and 3 green items (green ball, green horse, 
green 
> >wagon).
> >
> >However if you do not use the entire result set (you apply a LIMIT 
clause) 
> >so that you leave out some of the combinations, I think you achieve the 

> >results you are looking for. Extending this to our sample data, if I 
LIMIT 
> >the results to just 6 rows I get many different possible result sets 
(how 
> >many unique lists exist of 9 things when taken 6 at a time? - I would 
do 
> >the math but I can't remember which formula to use....). One set may 
have 
> >no red things while another could have 2 reds, 2 greens, and 2 blues 
and 
> >another 3 reds, 2 greens, and a blue.
> 
> n choose k (for combinations), fact(9)/(fact(6)*fact(3))...
> 
> 
> >Is this the randomizing effect you were going for?
> 
> I am still trying to work out what I do want....
> 
> OK, I think I worked it out, I want to *scramble* a list of pairs.
> 
> Table1
> A   B
> 1   10
> 2   20
> 3   30
> 
> select scramble(A,B) from Table1;
> A_scram   B_scram
> 1   30
> 2   10
> 3   20
> 
> (as an example, naturally the results could differ each time the query 
is 
> run)
> 
> The problem with just limiting the cartesian product is that it dosn't
> guarante that every value of A and every value of B is represented the
> correct number of times.
> 
> Here is another example...
> 
> A   B
> 1       10
> 1   20
> 1   30
> 2       20
> 3       30
> 
> select scramble(A,B) from Table1;
> A_scram B_scram
> 1       30
> 1   20
> 1   30
> 2       10
> 3       20
> 
> (for example).
> 
> How can I do this with SQL? Could a SCRAMBLE(A,B) UDF be a reasonable
> request?
> 
> Cheers,
> Dan.
> 
> 
> >
> >Shawn Green
> >Database Administrator
> >Unimin Corporation - Spruce Pine
> >
> 

I used a method similar to this to help someone line up payment 
transactions so that they could detect the unmatched payments and the 
unmatched billings (he was straightening out some old accounting records)

First you create a table (or a temporary table, your choice) that looks 
something like this

CREATE TEMPORARY TABLE tmpScramble (
        Acolumn int not null
        , Bcolumn int not null
        , UNIQUE KEY (Acolumn)
        , UNIQUE KEY (Bcolumn)
)

Then you populate the table with a randomly sequenced Cartesian join of 
your source data.

INSERT IGNORE tmpScramble(Acolumn, Bcolumn)
SELECT t1.A, t2.B
FROM table1 t1
INNER JOIN table2 t2
ORDER BY RAND();

This will produce a random combination of the rows from table1 and the 
rows from table2.  However because of the UNIQUE keys on tmpScramble the 
unique values from each table can only appear once in tmpScramble. The 
IGNORE keyword on the INSERT statement means that duplicate key value 
errors are ignored which allows the table to fully populate. Basically, 
you get the first row that contains unique values of both A and B as they 
appears in the random sort of their combination. All other rows that 
contain a value of A or B that you have already seen are silently 
rejected.

Now, if you have data that looks like 

table1
---
  1
  1
  1
  2
  2
  3
  4

This techinique won't work because the 2nd and later instances of each 
value will not appear in tblScramble. In this case, you don't select the 
data values from table1 but your primary key values and add them to 
tmpScramble. That way you are assured that each row from each table is 
represented at least once. To get back to your data values, JOIN 
tmpScramble back to table1 or table2 as appropriate to resolve the primary 
key values back into data values.

If doesn't make sense or if it does and you would still like some direct 
help, you can post your actual table structures and I or someone else can 
help to write the query to produce some randomly scrambled pairs.

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

RE: UDF request?

Reply via email to