I have some queries, involving a largish number of JOIN, which
are apparently very slow or even take forever (a mysqladmin processlist
shows them remain in the "statistics" status for a long time, in most
cases I have to kill them after several minutes).
When I first had the problem I googled around and found some reference
(which I've lost) saying that the "statistics" status is actually what one
does with an EXPLAIN SELECT, and that this is done preliminarily to the
actual query. It also said it might occur with a large number of joins
because this analysis, for n joins MIGHT try up to n! combinations, unless
one somehow specified the priorities (but the author did not remember
how).
I thought to have overcome the problem using a feature of the CREATE
VIEW command (see below), but apparently I simply moved it to an higher n.
Now I tried to see how it scales with the number of joins, and the curious
things is that e.g. for n=9 it works fast, for n=15 it works slowly, for
n=18 works fast again and for n=20 takes an infinite time.
I'll first explain my background :
- I have a number of tables (let's call them t0, t1, t2 ...)
- all of them have an auto_increment column called seq which is also
an index
- one table (t0) is "more important" (actually it is list of
celestial X-ray sources while the other are celestial objects
in other wavebands but this is irrelevant to you).
- I have precomputed correlation tables among t0 and each of
the other. These tables are called eg. t0ti, have two columns
t0 and ti corresponding to the t0.seq and ti.seq of objects which
are "associated". They are indexed on (t0,ti). Note that an
object in t0 can be associated with 1 or more or zero (t0ti.ti null)
objects in ti.
- I originally (already under mysql 3) devised a way to identify
counterparts in MORE tables (all these associations are based on
spherical distance + other criteria). This involved creating a
working table G
This table has columns named t0 t1 ... tn (containing the pointers
t0.seq t1.seq ... for counterparts associated and validated according
to some criteria) plus other service columns
The simultaneous access was achieved in our interface by a mechanism
we called virtual tables, which essentially was
SELECT
some subset of columns in some of the t0...tn
or some expression thereof
FROM
G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
...
left join tn on G.tn=tn.seq
We refer to the t0...tn as "the member tables" of G.
We have different versions of G corresponding to different sets
of member tables and different association criteria.
The largest of our cases has 26 different members.
Our mechanism was such that we defined a subset of columns in
each of the ti (or expressions thereof like distances etc.) as
"interesting", with an associated alias. Our interface usually
showed only such "virtual columns", but had a possibility to add
(naming them manually as ti.colname) to the SELECT also all
other "member columns" normally hidden.
We also allow to correlate a virtual table with a single physical
table tk (be it member or not) using the t0tk correlation table
(t0 is the "First Member").
- the above worked and still works, but has some clumsiness. When
we upgraded to mysql 5 and discovered the CREATE VIEW command
we decided to replace our virtual tables with views.
- for each G we define a view as
create algoritm=temptable view V as
SELECT
some subset of columns in some of the t0...tn or in G
or some expression thereof
FROM
G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
...
left join tn on G.tn=tn.seq
- the "algorithm=temptable" was required because without it some
of our queries (see below) entered in the "statistics" status forever
already with 11 members
- for the rest the VIEWs work nicely when used standalone and are easier
for the user ...
- ... but on the other hand they HIDE the member columns which are
not explicitly named in CREATE VIEW (where one wants to keep a
manageable number of columns). Hide means here that their names
ti.colname cannot be used in SELECT !
- so we devised an option by which on ticking on "show members also"
one can also include these ti.colname in the query
de facto this doubles the joins, because the statement built is
SELECT
list of (V.colname and ti.colname with i chosen among 0 and n)
FROM
( G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
...
left join tn on G.tn=tn.seq
) left join V on G.seq=V.seq
This statement NOW works (it did not work with e.g. 11 member tables
before we switched to ALGORITHM=TEMPTABLE in the CREATE VIEW).
An "explain select" for a query on such views gives that a view with
n members "with members also enabled" involves 29+n queries
(our [working] maximum is n=26 with 55 queries)
I have also noticed that such "explain select" are relatively fast
(0.3 sec) and are somehow cached (if I repeat one, it takes zero time)
- the last step would be to make a query on a VIEW with "show members
also" ticked, correlated with another table Tk (be Tk either a
member, one of t1...tn, or a non-member)
This involves a query like this
SELECT
list of (V.colname and ti.colname and Tk.colname)
FROM
( VTk left join
(
(G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
...
left join tn on G.tn=tn.seq
) left join V on G.seq=V.seq
) on VTk.V = V.seq
) left join Tk on VTk.Tk=Tk/seq
where the correlation table VTk is also a view
CREATE ALGORITHM=TEMPTABLE VIEW VTk as
select G.seq as V, G.Tk as Tk
from G left join t0Tk on G.t0=t0Tk.t0
which exploits the precomputed correlation among the "important"
table t0 (the "First Member") and Tk. Note that t0Tk is indexed
on (t0,Tk) but the view apparently is not indexed.
- now such a query on our largest table (26 members) hangs in
statistics status forever. I experimented therefore trying to
use a variable names of left joins (this to tell when it stops
working ... in real life I *might* need all, or any of the member).
The query for "view with members also correlated with another table"
in an "explain select" requires 33+n queries for n members.
Now the funny thing is that up to e.g. 9 members "explain select"
is rather fast (0.3 sec).
For 15 members (48 queries) slows down significantly (6.3 sec) and
is not cached (even if repeated soon still takes 6.3 sec).
For 20 members it takes forever.
But for 18 members (51 queries) instead takes only 0.5 sec.
When I say 9,15,18,20 members I mean the first 9,15,18,20 of the full
member list. Such a list is in an arbitrary order (sort of historical the
various tables entered the database).
Also I note that the output of "explain select" presents the various
tables in a varying order. For instance the slow 15-member case but also
the fast 9-member case have "<derived2>" (which is the 13000-element
un-indexed VTk correlation view) interspersed with the other table, the
fast 18-member case has it first.
Questions :
- what does "explain select" actually do and why sometimes hangs ?
- can this be overcome rearranging the order of the joins (note that
the bulk of the members are all joined with G), or introducing
parentheses or with other syntax changes ?
- or has it to do with some configuration parameter, maybe related
to what is cached, cache size or other ?
Thanks in advance to whoever is able to give hints.
--
-----------------------------------------------------------------------
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
-----------------------------------------------------------------------
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]