[sqlite] Query optimizer bug?

Ryan Johnson Thu, 14 Mar 2013 11:07:27 -0700

Hi all,

I'm running sqlite-3.7.13 on cygwin. Playing around with various TPC-Hqueries with my class recently, I hit a strangely slow query and don'tunderstand why it's so slow.

The schema and dataset generator are available at tpc.org, and end ofthis message has instructions to replicate my setup quickly.

Short version for the impatient: running a particular query withselective predicates on the first of many tables to be joined, thosepredicates aren't applied until after the last join... even though thequery uses an index on the predicate column to access the offendingtable. I would have expected the index probe to have the effect ofpushing down the predicate, but pushing down predicates manually makesthe query run ~20x faster.


Long version follows...

The offending query (slightly modified version of Q7:

select count(*)
from supplier, lineitem, orders, customer, nation n1, nation n2
where s_suppkey = l_suppkey and o_orderkey = l_orderkey
      and c_custkey = o_custkey and s_nationkey = n1.n_nationkey
      and c_nationkey = n2.n_nationkey and (
          (n1.n_name = 'ALGERIA' and n2.n_name = 'EGYPT')
          or (n1.n_name = 'EGYPT' and n2.n_name = 'ALGERIA')
      ) and l_shipdate between  '1995-01-01' and  '1996-12-31';

The query counts 561 rows and takes 4.9 seconds to execute with a warmpage cache (pragma cache_size = - 256000).

Note that this query is a pain to optimize: the intuitive dataflow takesa v-shape, with lineitem at the point and the two nations at the ends(nation -> customer -> orders -> lineitem <- supplier <- nation); thepredicates are highly selective but involve nation (both ends of thechain) and lineitem (in the middle). The optimizer chooses this plan:


0|0|4|SCAN TABLE nation AS n1 (~25 rows)
0|1|5|SEARCH TABLE nation AS n2 USING INDEX nni (N_NAME=?) (~1 rows)
0|1|5|SEARCH TABLE nation AS n2 USING INDEX nni (N_NAME=?) (~1 rows)
0|2|0|SEARCH TABLE supplier USING INDEX snki (S_NATIONKEY=?) (~40 rows)
0|3|3|SEARCH TABLE customer USING INDEX cnki (C_NATIONKEY=?) (~600 rows)
0|4|2|SEARCH TABLE orders USING INDEX ocki (O_CUSTKEY=?) (~15 rows)
0|5|1|SEARCH TABLE lineitem USING INDEX lpki (L_ORDERKEY=?) (~2 rows)

Dropping index nni and disabling automatic indexing improves things abit (3.3 s, 50% speedup):


0|0|4|SCAN TABLE nation AS n1 (~25 rows)
0|1|0|SEARCH TABLE supplier USING INDEX snki (S_NATIONKEY=?) (~40 rows)
0|2|1|SEARCH TABLE lineitem USING INDEX lski (L_SUPPKEY=?) (~300 rows)
0|3|2|SEARCH TABLE orders USING INDEX opki (O_ORDERKEY=?) (~1 rows)
0|4|3|SEARCH TABLE customer USING INDEX cpki (C_CUSTKEY=?) (~1 rows)
0|5|5|SEARCH TABLE nation AS n2 USING INDEX npki (N_NATIONKEY=?) (~1 rows)

Removing predicates on n1 yields ~15k rows in 3.4s, with the following plan:
sqlite> explain query plan select count(*)
from supplier, lineitem, orders, customer, nation n1, nation n2
where s_suppkey = l_suppkey and o_orderkey = l_orderkey
      and c_custkey = o_custkey and s_nationkey = n1.n_nationkey
      and c_nationkey = n2.n_nationkey and (
          (n2.n_name = 'ALGERIA') or (n2.n_name = 'EGYPT')
      ) and l_shipdate between  '1995-01-01' and  '1996-12-31';

0|0|4|SCAN TABLE nation AS n1 (~25 rows)
0|1|0|SEARCH TABLE supplier USING INDEX snki (S_NATIONKEY=?) (~40 rows)
0|2|1|SEARCH TABLE lineitem USING INDEX lski (L_SUPPKEY=?) (~300 rows)
0|3|2|SEARCH TABLE orders USING INDEX opki (O_ORDERKEY=?) (~1 rows)
0|4|3|SEARCH TABLE customer USING INDEX cpki (C_CUSTKEY=?) (~1 rows)
0|5|5|SEARCH TABLE nation AS n2 USING INDEX npki (N_NATIONKEY=?) (~1 rows)

Presumably it's slow because predicates don't really hit until after allthe joins finish. Removing predicates on n2 uses almost the same plan,and yields nearly the same row count (~14k rows), but executes in 0.25 sbecause the predicate applies before the first join:


sqlite> explain query plan select count(*)
from supplier, lineitem, orders, customer, nation n1, nation n2
where s_suppkey = l_suppkey and o_orderkey = l_orderkey
      and c_custkey = o_custkey and s_nationkey = n1.n_nationkey
      and c_nationkey = n2.n_nationkey and (
          (n1.n_name = 'ALGERIA' ) or (n1.n_name = 'EGYPT')
      ) and l_shipdate between  '1995-01-01' and  '1996-12-31';

0|0|4|SCAN TABLE nation AS n1 (~25 rows)
0|1|0|SEARCH TABLE supplier USING INDEX snki (S_NATIONKEY=?) (~40 rows)
0|2|1|SEARCH TABLE lineitem USING INDEX lski (L_SUPPKEY=?) (~300 rows)
0|3|2|SEARCH TABLE orders USING INDEX opki (O_ORDERKEY=?) (~1 rows)
0|4|3|SEARCH TABLE customer USING INDEX cpki (C_CUSTKEY=?) (~1 rows)

0|5|5|SEARCH TABLE nation AS n2 USING COVERING INDEX npki(N_NATIONKEY=?) (~1 rows)

Working from that observation, I changed the query to manually push downpredicates, which reduces the runtime to 0.25 s (~20x speedup):


sqlite> explain query plan select count(*)
from supplier, lineitem, orders, customer, (
        select * from nation where n_name = 'ALGERIA' or n_name = 'EGYPT'
     ) n1, nation n2
where s_suppkey = l_suppkey and o_orderkey = l_orderkey
      and c_custkey = o_custkey and s_nationkey = n1.n_nationkey
      and c_nationkey = n2.n_nationkey and (
          (n1.n_name = 'ALGERIA' and n2.n_name = 'EGYPT')
          or (n1.n_name = 'EGYPT' and n2.n_name = 'ALGERIA')
      ) and l_shipdate between  '1995-01-01' and  '1996-12-31';


0|0|4|SCAN TABLE nation AS n1 (~25 rows)
0|1|0|SEARCH TABLE supplier USING INDEX snki (S_NATIONKEY=?) (~40 rows)
0|2|1|SEARCH TABLE lineitem USING INDEX lski (L_SUPPKEY=?) (~300 rows)
0|3|2|SEARCH TABLE orders USING INDEX opki (O_ORDERKEY=?) (~1 rows)
0|4|3|SEARCH TABLE customer USING INDEX cpki (C_CUSTKEY=?) (~1 rows)
0|5|5|SEARCH TABLE nation AS n2 USING INDEX npki (N_NATIONKEY=?) (~1 rows)

Unfortunately, restoring the index on nation.n_name, or enablingautomatic indexing, undoes everything and pushes runtime back up to 4.9s, even though all nation rows are fetched through the predicatecolumn's index:


0|0|4|SEARCH TABLE nation USING INDEX nni (N_NAME=?) (~20 rows)
0|0|0|EXECUTE LIST SUBQUERY 1
0|1|5|SEARCH TABLE nation AS n2 USING INDEX nni (N_NAME=?) (~10 rows)
0|1|5|SEARCH TABLE nation AS n2 USING INDEX nni (N_NAME=?) (~10 rows)
0|2|0|SEARCH TABLE supplier USING INDEX snki (S_NATIONKEY=?) (~40 rows)
0|3|3|SEARCH TABLE customer USING INDEX cnki (C_NATIONKEY=?) (~600 rows)
0|4|2|SEARCH TABLE orders USING INDEX ocki (O_CUSTKEY=?) (~15 rows)
0|5|1|SEARCH TABLE lineitem USING INDEX lpki (L_ORDERKEY=?) (~2 rows)

Now, I can imagine that part of "lite"means "not smart enough to splitthat disjunction across two tables in an intelligent way." That's fine(though obviously it would be nice to have the feature). What I don'tunderstand is this: when the nni index is available, all nation tuples(n1 and n2) are fetched using the index, but the query performs as ifthere were no predicate at all (meaning the predicates must have beenapplied after all the joins completed). This can be confirmed byremoving all predicates on n_name: the query processes 183k rows in 3.2seconds, the same time needed to run with only the n1 predicates missing.

Why isn't the index probe at the front of the pipeline more effective atreducing the number of rows that pass through (which, incidentally,there doesn't seem to be a stat for)?

It's not too painful to remove the offending index when nation has only25 rows, but otherwise that first scan (without using nni) could havebeen really painful... is there any better way to go about this? (I'vehit the problem before with other queries, though this is the first timeI've analyzed it in enough detail to send to the list).

To replicate my local setup, download this compressed database (28MB):http://www.cs.utoronto.ca/~ryanjohn/tpch-10.db.xz

Then ANALYZE it after adding the following indexes:

create index snki on supplier(s_nationkey);
create index lski on lineitem(L_SUPPKEY);
create index opki on orders(O_ORDERKEY);
create index cpki on customer(C_CUSTKEY);
create index npki on nation(N_NATIONKEY);
create index nni on nation(N_NAME);

Thoughts?
Ryan

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Query optimizer bug?

Reply via email to