Re: ranking album plays by city, help me beat Oracle

S A Tue, 20 Feb 2001 19:19:10 -0800

> Huh? You got it 7 times faster by only extracting that query from the

> for loop and changing it to a group by? Or did you do anything else?


Sorry there were some other unnecessary queries that were mistakenly in the 
measurement.. 


Here is the missing EXPLAIN.

mysql> EXPLAIN SELECT same_album_id, SUM(td_num) FROM album_by_metro
    -> WHERE metro = 'Atlanta' AND rpt_dt >= '2001-01-01' AND
    -> rpt_dt <= '2001-02-14'
    -> GROUP BY same_album_id HAVING SUM(td_num) > 500
    -> ;
+----------------+-------+----------------------------+---------------------+---
------+------+------+-----------------------------+
| table          | type  | possible_keys              | key                 | ke
y_len | ref  | rows | Extra                       |
+----------------+-------+----------------------------+---------------------+---
------+------+------+-----------------------------+
| album_by_metro | range | rpt_dt,album_by_metro_idx2 | album_by_metro_idx2 |
  103 | NULL | 8925 | where used; Using temporary |
+----------------+-------+----------------------------+---------------------+---
------+------+------+-----------------------------+
1 row in set (0.00 sec)

Sorry I tried the below suggested query but it it is much slower at 1min 30 secs 
probably because it ranks all albums for all metros rather than just working on the 
rank of one album over the metros.

SELECT metro, same_album_id, SUM(td_num) AS played
FROM album_by_metro
WHERE rpt_dt >= '$BeginDate' AND rpt_dt <= '$EndDate'
GROUP BY metro, same_album_id
HAVING SUM(td_num) > $MIN_SUM
ORDER BY metro, same_album_id, played DESC

Here is the explain for the above query :

 

mysql> EXPLAIN SELECT metro, same_album_id, SUM(td_num) AS played
    -> FROM album_by_metro
    -> WHERE rpt_dt >= '2001-01-01' AND rpt_dt <= '2001-02-14'
    -> GROUP BY metro, same_album_id
    -> HAVING SUM(td_num) > 1000
    -> ORDER BY metro, same_album_id, played DESC;
+----------------+-------+---------------+--------+---------+------+--------+---
-------------------------------------------------------+
| table          | type  | possible_keys | key    | key_len | ref  | rows   | Ex
tra                                                    |
+----------------+-------+---------------+--------+---------+------+--------+---
-------------------------------------------------------+
| album_by_metro | index | rpt_dt        | rpt_dt |     161 | NULL | 716943 | wh
ere used; Using index; Using temporary; Using filesort |
+----------------+-------+---------------+--------+---------+------+--------+---
------------------------------------------------------+


Can this faster approach below which in a loop queries the # of albums with > sums be 
optimized to be a single query. ?

 foreach ($metros as $metro)
 {
  $V_SUM = $V_SUMS[$metro];

  $query = "SELECT same_album_id, SUM(td_num) FROM album_by_metro
    WHERE  metro = '$metro' AND rpt_dt >= '$BeginDate' AND rpt_dt <= '$EndDate'
      GROUP BY same_album_id HAVING SUM(td_num) > $V_SUM";
}

Thanks for all your help & interest.

- Sam.

  Benjamin Pflugmann <[EMAIL PROTECTED]> wrote: 
Hello.

On Fri, Feb 16, 2001 at 03:58:20PM -0800, [EMAIL PROTECTED] wrote:
> 
> This is the latest approach we're using now. It turns out that the
> query to get Total Plays for an Album per City can be taken out of
> the City for loop & done only once as a GROUP BY. The worst case
> time for MySQL is now 16 seconds compared to Oracle's 8 seconds.

Huh? You got it 7 times faster by only extracting that query from the
for loop and changing it to a group by? Or did you do anything else?

Well, because the column "range" is not used in any of both queries,
normally only the indexes "album_by_metro_idx1/2" are used. As I said
last time, maybe the indexes could be more specific for the query.
Your change now uses idx2 and has the advantage, that all metros it
matches are really used, therefore the index is now okay for the new
query. Before, it was reading all metros in, although it was only
using one.

If you want to try, it would be interesting, what the time for the old
query in the for loop is, when you have an index on
(same_album_id,metro,rpt_dt). Anyhow, your new solution surely is
faster (avoiding communition overhead for about 100 queries).

> $query = "SELECT metro, sum(td_num) FROM album_by_metro
> WHERE same_album_id = $album
> AND rpt_dt >= '$BeginDate' AND rpt_dt <= '$EndDate'
> GROUP BY metro"; 
> $start = time(); 
> $result = mysql_query($query) or die ("Can't get $query" . mysql_error()); 
> 
> while ($row = mysql_fetch_array($result)) 
> { 
> $metro = $row[0]; 
> $metros[] = $metro; 
> $V_SUMS[$metro] = $row[1];
> }
> // while 
> mysql_free_result($result); 
> // 
> foreach ($metros as $metro)
> {
> $V_SUM = $V_SUMS[$metro]; 
> $query = "SELECT same_album_id, SUM(td_num) FROM album_by_metro
> WHERE metro = '$metro' AND rpt_dt >= '$BeginDate' AND rpt_dt <= '$EndDate'
> GROUP BY same_album_id HAVING SUM(td_num) > $V_SUM";

This will use index album_by_metro_idx2, which already should be the
best index for this kind of query.

If I am not mistaken, you can rewrite this query to a single one
(getting rid of the query within the loop), too, but have to do a bit
more parsing on the client side, afterwards

SELECT metro, same_album_id, SUM(td_num) AS played
FROM album_by_metro
WHERE rpt_dt >= '$BeginDate' AND rpt_dt <= '$EndDate'
GROUP BY metro, same_album_id
HAVING SUM(td_num) > $MIN_SUM
ORDER BY metro, same_album_id, played DESC

MIN_SUM is the minimum of all V_SUMS. The point is, that the database
server has to read in _all_ records matching the where clause in any
case, because HAVING is applied _after_ the rest of the query has
finished. So MySQL got together all records anyhow, this query only
sends some more to the client, which has to simply ignore all with
SUM(td_num) <= $V_SUMS[$metro].

Give it a try, I don't really know, whether it will be faster. It will
probably use index "rpt_dt", because it's the only one with rpt_dt on
the left side. It could use a smaller index (only on rpt_dt or on
rpt_dt,metro,same_album_id), but that will make only a small speed
difference.

Of course, you can combine that all that is done on the database side
(assuming you are using version 3.23.x, but it can be rewritten for
3.22.x, too):

CREATE TEMPORARY TABLE foo (
metro VARCHAR(100) NOT NULL,
same_album_id INT NOT NULL,
played INT NOT NULL,
INDEX (same_album_id),
INDEX (metra,played)
) SELECT metro, same_album_id, SUM(td_num) AS played
FROM album_by_metro
WHERE rpt_dt >= '$BeginDate' AND rpt_dt <= '$EndDate'
GROUP BY metro, same_album_id
ORDER BY metro, same_album_id, played DESC

SELECT l.metro,
SUM( IF(v.same_album_id IS NULL, 0, 1) ) + 1 AS rank
FROM foo AS l LEFT JOIN foo as v
ON v.metro = l.metro AND v.played > l.played
WHERE l.same_album_id = $album
GROUP BY metro

The first query creates a temporary table which contains the sum of
plays for each metro and album. The second one then chooses the album
id in question and looks for each metro for all albums which played
better.

The LEFT JOIN assures that we don't "loose" a metro, when there is no
better album. In this case the album ranked #1 and we get NULL values
for the second table v. This is way we cannot use COUNT (it would
count the NULL row, too), but have to use this SUM(IF...)) stuff.

Then you get on your client side a list with each metro and the rank
of the album in that city. The indexes you have should be sufficient,
but I can only say this for sure, after I have seen the EXPLAINs for
all three statements above.

The self-join on a temporary table could be a problem, too (I remember
that I read something about this on the list once, and don't recall
anymore whether the problem was fixed meanwhile).

Whether or not this will give an speed improvement depends. I would
expect one (because we get rid of the overhead of 100 queries). If
it's significantly slower, there is something wrong and it would be
nice if you could send the EXPLAINs for the two queries.

[...]
> mysql> EXPLAIN SELECT same_album_id, SUM(td_num) FROM album_by_metro
> WHERE metro = 'Atlanta' AND rpt_dt >= '2001-01-01' AND
> rpt_dt <= '2001-02-14'
> GROUP BY same_album_id HAVING SUM(td_num) > 500

Aehem :-) The output of this command is missing.

> mysql> show index from album_by_metro;
> +----------------+------------+---------------------+--------------+------------
> ---+-----------+-------------+----------+--------+---------+
> | Table | Non_unique | Key_name | Seq_in_index | Column_name
> | Collation | Cardinality | Sub_part | Packed | Comment |
> +----------------+------------+---------------------+--------------+------------

I hope some of the suggestion will help.

Bye,

Benjamin.


[...]
> > On Thu, Feb 15, 2001 at 10:47:16AM -0800, [EMAIL PROTECTED] wrote:
> > > 
> > > I'm moving over a music play database from Oracle to MySQL. So far MySQL is 
>faster for all reports except this one where Oracle takes 8 seconds & MySQL will take 
>108 with my current method.
> > [...]
> > > I want to report the ranking of an album's plays (against all other albums ) for 
>every city over a time period.
> > > National album ranking for "Beatles #1" between 1/1/2001 and 2/14/2001 :
> > > New York #1
> > > Chicago #2
> > > Boston #1
> > > Miami #3
> > > 
> > > Right now the current approach involves looping over every city & making 2 
>queries. First query to get the total # of plays for this album for the time period 
>in that city. Next query to figure out how many other albums for that city & for that 
>time have total play sums that are greater, which is the Rank.
> > > 
> > > foreach $cities as $this_city
> > > {
> > > $this_album_total_plays = "select sum(total_plays) from album_by_city
> > > where album_id = $album_id
> > > and city = $this_city
> > > and date >= $StartDate and date <= $EndDate";
> > > 
> > > $rank = "select album_id, sum(total_plays) from album_by_city
> > > where city = $this_city
> > > and date >= $StartDate and date <= $EndDate
> > > group by album_id
> > > having sum(total_plays) > $this_album_total_plays";
> > > echo "$this_city $rank";
> > > }
> > > Is there a way to do this with fewer queries or perhaps temporary tables for 
>greater speed ?
> 


---------------------------------
Do You Yahoo!?
Yahoo! Auctions - Buy the things you want at great prices!
Re: ranking album plays by city, help me beat Oracle

Reply via email to