excessive time spent in "statistics" status

Lucio Chiappetti Fri, 31 Aug 2007 08:01:40 -0700

I have some queries, involving a largish number of JOIN, which 
are apparently very slow or even take forever (a mysqladmin processlist 
shows them remain in the "statistics" status for a long time, in most 
cases I have to kill them after several minutes).


When I first had the problem I googled around and found some reference 
(which I've lost) saying that the "statistics" status is actually what one 
does with an EXPLAIN SELECT, and that this is done preliminarily to the 
actual query. It also said it might occur with a large number of joins 
because this analysis, for n joins MIGHT try up to n! combinations, unless 
one somehow specified the priorities (but the author did not remember 
how).

I thought to have overcome the problem using a feature of the CREATE 
VIEW command (see below), but apparently I simply moved it to an higher n.

Now I tried to see how it scales with the number of joins, and the curious 
things is that e.g. for n=9 it works fast, for n=15 it works slowly, for 
n=18 works fast again and for n=20 takes an infinite time.

I'll first explain my background :

 - I have a number of tables (let's call them t0, t1, t2 ...)
 - all of them have an auto_increment column called seq which is also
   an index

 - one table (t0) is "more important" (actually it is list of
   celestial X-ray sources while the other are celestial objects
   in other wavebands but this is irrelevant to you).

 - I have precomputed correlation tables among t0 and each of
   the other. These tables are called eg. t0ti, have two columns
   t0 and ti corresponding to the t0.seq and ti.seq of objects which
   are "associated". They are indexed on (t0,ti). Note that an
   object in t0 can be associated with 1 or more or zero (t0ti.ti null)
   objects in ti.

 - I originally (already under mysql 3) devised a way to identify 
   counterparts in MORE tables (all these associations are based on 
   spherical distance + other criteria). This involved creating a
   working table G

   This table has columns named t0 t1 ... tn (containing the pointers
   t0.seq t1.seq ... for counterparts associated and validated according 
   to some criteria) plus other service columns

   The simultaneous access was achieved in our interface by a mechanism
   we called virtual tables, which essentially was

     SELECT
      some subset of columns in some of the t0...tn
      or some expression thereof
     FROM
      G left join t0 on G.t0=t0.seq
        left join t1 on G.t1=t1.seq
        ...
        left join tn on G.tn=tn.seq    

   We refer to the t0...tn as "the member tables" of G.

   We have different versions of G corresponding to different sets
   of member tables and different association criteria.

   The largest of our cases has 26 different members.

   Our mechanism was such that we defined a subset of columns in
   each of the ti (or expressions thereof like distances etc.) as
   "interesting", with an associated alias. Our interface usually
   showed only such "virtual columns", but had a possibility to add
   (naming them manually as ti.colname) to the SELECT also all
   other "member columns" normally hidden.

   We also allow to correlate a virtual table with a single physical
   table tk (be it member or not) using the t0tk correlation table
   (t0 is the "First Member").

 - the above worked and still works, but has some clumsiness. When
   we upgraded to mysql 5 and discovered the CREATE VIEW command
   we decided to replace our virtual tables with views.

 - for each G we define a view as

     create algoritm=temptable view V as
     SELECT
      some subset of columns in some of the t0...tn or in G
      or some expression thereof
     FROM
      G left join t0 on G.t0=t0.seq
        left join t1 on G.t1=t1.seq
        ...
        left join tn on G.tn=tn.seq    

 - the "algorithm=temptable" was required because without it some
   of our queries (see below) entered in the "statistics" status forever 
   already with 11 members
   
 - for the rest the VIEWs work nicely when used standalone and are easier
   for the user ...

 - ... but on the other hand they HIDE the member columns which are
   not explicitly named in CREATE VIEW (where one wants to keep a
   manageable number of columns). Hide means here that their names
   ti.colname cannot be used in SELECT !   

 - so we devised an option by which on ticking on "show members also"
   one can also include these ti.colname in the query
   
   de facto this doubles the joins, because the statement built is

   SELECT 
    list of (V.colname and ti.colname with i chosen among 0 and n)
   FROM
    ( G left join t0 on G.t0=t0.seq
        left join t1 on G.t1=t1.seq
        ...
        left join tn on G.tn=tn.seq    
    )   left join V  on G.seq=V.seq 

   This statement NOW works (it did not work with e.g. 11 member tables
   before we switched to ALGORITHM=TEMPTABLE in the CREATE VIEW).

   An "explain select" for a query on such views gives that a view with
   n members "with members also enabled" involves 29+n queries

   (our [working] maximum is n=26 with 55 queries)

   I have also noticed that such "explain select" are relatively fast
   (0.3 sec) and are somehow cached (if I repeat one, it takes zero time)

 - the last step would be to make a query on a VIEW with "show members 
   also" ticked, correlated with another table Tk (be Tk either a
   member, one of t1...tn, or a non-member)

   This involves a query like this

   SELECT 
    list of (V.colname and ti.colname and Tk.colname)
   FROM
   ( VTk left join 
    ( 
     (G left join t0 on G.t0=t0.seq
        left join t1 on G.t1=t1.seq
        ...
        left join tn on G.tn=tn.seq    
     )  left join V  on G.seq=V.seq 
    ) on VTk.V = V.seq
   )    left join Tk on VTk.Tk=Tk/seq

   where the correlation table VTk is also a view

   CREATE ALGORITHM=TEMPTABLE VIEW VTk as
    select G.seq as V, G.Tk as Tk 
    from   G left join t0Tk on G.t0=t0Tk.t0

   which exploits the precomputed correlation among the "important"
   table t0 (the "First Member") and Tk. Note that t0Tk is indexed
   on (t0,Tk) but the view apparently is not indexed.

 - now such a query on our largest table (26 members) hangs in
   statistics status forever. I experimented therefore trying to
   use a variable names of left joins (this to tell when it stops
   working ... in real life I *might* need all, or any of the member).

   The query for "view with members also correlated with another table"
   in an "explain select" requires 33+n queries for n members.

   Now the funny thing is that up to e.g. 9 members "explain select"
   is rather fast (0.3 sec). 

   For 15 members (48 queries) slows down significantly (6.3 sec) and
   is not cached (even if repeated soon still takes 6.3 sec).

   For 20 members it takes forever.

   But for 18 members (51 queries) instead takes only 0.5 sec.
   
When I say 9,15,18,20 members I mean the first 9,15,18,20 of the full 
member list. Such a list is in an arbitrary order (sort of historical the 
various tables entered the database).

Also I note that the output of "explain select" presents the various 
tables in a varying order. For instance the slow 15-member case but also 
the fast 9-member case have "<derived2>" (which is the 13000-element 
un-indexed VTk correlation view) interspersed with the other table, the
fast 18-member case has it first.

Questions :

 - what does "explain select" actually do and why sometimes hangs ?

 - can this be overcome rearranging the order of the joins (note that
   the bulk of the members are all joined with G), or introducing
   parentheses or with other syntax changes ?

 - or has it to do with some configuration parameter, maybe related
   to what is cached, cache size or other ?
 
Thanks in advance to whoever is able to give hints.

-- 
-----------------------------------------------------------------------
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
-----------------------------------------------------------------------

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

excessive time spent in "statistics" status

Reply via email to