Hi all,

I'm running sqlite-3.7.13 on cygwin. Playing around with various TPC-H queries with my class recently, I hit a strangely slow query and don't understand why it's so slow.

The schema and dataset generator are available at tpc.org, and end of this message has instructions to replicate my setup quickly.

Short version for the impatient: running a particular query with selective predicates on the first of many tables to be joined, those predicates aren't applied until after the last join... even though the query uses an index on the predicate column to access the offending table. I would have expected the index probe to have the effect of pushing down the predicate, but pushing down predicates manually makes the query run ~20x faster.

Long version follows...

The offending query (slightly modified version of Q7:

select count(*)
from supplier, lineitem, orders, customer, nation n1, nation n2
where s_suppkey = l_suppkey and o_orderkey = l_orderkey
      and c_custkey = o_custkey and s_nationkey = n1.n_nationkey
      and c_nationkey = n2.n_nationkey and (
          (n1.n_name = 'ALGERIA' and n2.n_name = 'EGYPT')
          or (n1.n_name = 'EGYPT' and n2.n_name = 'ALGERIA')
      ) and l_shipdate between  '1995-01-01' and  '1996-12-31';

The query counts 561 rows and takes 4.9 seconds to execute with a warm page cache (pragma cache_size = - 256000).

Note that this query is a pain to optimize: the intuitive dataflow takes a v-shape, with lineitem at the point and the two nations at the ends (nation -> customer -> orders -> lineitem <- supplier <- nation); the predicates are highly selective but involve nation (both ends of the chain) and lineitem (in the middle). The optimizer chooses this plan:

0|0|4|SCAN TABLE nation AS n1 (~25 rows)
0|1|5|SEARCH TABLE nation AS n2 USING INDEX nni (N_NAME=?) (~1 rows)
0|1|5|SEARCH TABLE nation AS n2 USING INDEX nni (N_NAME=?) (~1 rows)
0|2|0|SEARCH TABLE supplier USING INDEX snki (S_NATIONKEY=?) (~40 rows)
0|3|3|SEARCH TABLE customer USING INDEX cnki (C_NATIONKEY=?) (~600 rows)
0|4|2|SEARCH TABLE orders USING INDEX ocki (O_CUSTKEY=?) (~15 rows)
0|5|1|SEARCH TABLE lineitem USING INDEX lpki (L_ORDERKEY=?) (~2 rows)

Dropping index nni and disabling automatic indexing improves things a bit (3.3 s, 50% speedup):

0|0|4|SCAN TABLE nation AS n1 (~25 rows)
0|1|0|SEARCH TABLE supplier USING INDEX snki (S_NATIONKEY=?) (~40 rows)
0|2|1|SEARCH TABLE lineitem USING INDEX lski (L_SUPPKEY=?) (~300 rows)
0|3|2|SEARCH TABLE orders USING INDEX opki (O_ORDERKEY=?) (~1 rows)
0|4|3|SEARCH TABLE customer USING INDEX cpki (C_CUSTKEY=?) (~1 rows)
0|5|5|SEARCH TABLE nation AS n2 USING INDEX npki (N_NATIONKEY=?) (~1 rows)

Removing predicates on n1 yields ~15k rows in 3.4s, with the following plan:
sqlite> explain query plan select count(*)
from supplier, lineitem, orders, customer, nation n1, nation n2
where s_suppkey = l_suppkey and o_orderkey = l_orderkey
      and c_custkey = o_custkey and s_nationkey = n1.n_nationkey
      and c_nationkey = n2.n_nationkey and (
          (n2.n_name = 'ALGERIA') or (n2.n_name = 'EGYPT')
      ) and l_shipdate between  '1995-01-01' and  '1996-12-31';

0|0|4|SCAN TABLE nation AS n1 (~25 rows)
0|1|0|SEARCH TABLE supplier USING INDEX snki (S_NATIONKEY=?) (~40 rows)
0|2|1|SEARCH TABLE lineitem USING INDEX lski (L_SUPPKEY=?) (~300 rows)
0|3|2|SEARCH TABLE orders USING INDEX opki (O_ORDERKEY=?) (~1 rows)
0|4|3|SEARCH TABLE customer USING INDEX cpki (C_CUSTKEY=?) (~1 rows)
0|5|5|SEARCH TABLE nation AS n2 USING INDEX npki (N_NATIONKEY=?) (~1 rows)

Presumably it's slow because predicates don't really hit until after all the joins finish. Removing predicates on n2 uses almost the same plan, and yields nearly the same row count (~14k rows), but executes in 0.25 s because the predicate applies before the first join:

sqlite> explain query plan select count(*)
from supplier, lineitem, orders, customer, nation n1, nation n2
where s_suppkey = l_suppkey and o_orderkey = l_orderkey
      and c_custkey = o_custkey and s_nationkey = n1.n_nationkey
      and c_nationkey = n2.n_nationkey and (
          (n1.n_name = 'ALGERIA' ) or (n1.n_name = 'EGYPT')
      ) and l_shipdate between  '1995-01-01' and  '1996-12-31';

0|0|4|SCAN TABLE nation AS n1 (~25 rows)
0|1|0|SEARCH TABLE supplier USING INDEX snki (S_NATIONKEY=?) (~40 rows)
0|2|1|SEARCH TABLE lineitem USING INDEX lski (L_SUPPKEY=?) (~300 rows)
0|3|2|SEARCH TABLE orders USING INDEX opki (O_ORDERKEY=?) (~1 rows)
0|4|3|SEARCH TABLE customer USING INDEX cpki (C_CUSTKEY=?) (~1 rows)
0|5|5|SEARCH TABLE nation AS n2 USING COVERING INDEX npki (N_NATIONKEY=?) (~1 rows)

Working from that observation, I changed the query to manually push down predicates, which reduces the runtime to 0.25 s (~20x speedup):

sqlite> explain query plan select count(*)
from supplier, lineitem, orders, customer, (
        select * from nation where n_name = 'ALGERIA' or n_name = 'EGYPT'
     ) n1, nation n2
where s_suppkey = l_suppkey and o_orderkey = l_orderkey
      and c_custkey = o_custkey and s_nationkey = n1.n_nationkey
      and c_nationkey = n2.n_nationkey and (
          (n1.n_name = 'ALGERIA' and n2.n_name = 'EGYPT')
          or (n1.n_name = 'EGYPT' and n2.n_name = 'ALGERIA')
      ) and l_shipdate between  '1995-01-01' and  '1996-12-31';


0|0|4|SCAN TABLE nation AS n1 (~25 rows)
0|1|0|SEARCH TABLE supplier USING INDEX snki (S_NATIONKEY=?) (~40 rows)
0|2|1|SEARCH TABLE lineitem USING INDEX lski (L_SUPPKEY=?) (~300 rows)
0|3|2|SEARCH TABLE orders USING INDEX opki (O_ORDERKEY=?) (~1 rows)
0|4|3|SEARCH TABLE customer USING INDEX cpki (C_CUSTKEY=?) (~1 rows)
0|5|5|SEARCH TABLE nation AS n2 USING INDEX npki (N_NATIONKEY=?) (~1 rows)

Unfortunately, restoring the index on nation.n_name, or enabling automatic indexing, undoes everything and pushes runtime back up to 4.9 s, even though all nation rows are fetched through the predicate column's index:

0|0|4|SEARCH TABLE nation USING INDEX nni (N_NAME=?) (~20 rows)
0|0|0|EXECUTE LIST SUBQUERY 1
0|1|5|SEARCH TABLE nation AS n2 USING INDEX nni (N_NAME=?) (~10 rows)
0|1|5|SEARCH TABLE nation AS n2 USING INDEX nni (N_NAME=?) (~10 rows)
0|2|0|SEARCH TABLE supplier USING INDEX snki (S_NATIONKEY=?) (~40 rows)
0|3|3|SEARCH TABLE customer USING INDEX cnki (C_NATIONKEY=?) (~600 rows)
0|4|2|SEARCH TABLE orders USING INDEX ocki (O_CUSTKEY=?) (~15 rows)
0|5|1|SEARCH TABLE lineitem USING INDEX lpki (L_ORDERKEY=?) (~2 rows)

Now, I can imagine that part of "lite"means "not smart enough to split that disjunction across two tables in an intelligent way." That's fine (though obviously it would be nice to have the feature). What I don't understand is this: when the nni index is available, all nation tuples (n1 and n2) are fetched using the index, but the query performs as if there were no predicate at all (meaning the predicates must have been applied after all the joins completed). This can be confirmed by removing all predicates on n_name: the query processes 183k rows in 3.2 seconds, the same time needed to run with only the n1 predicates missing.

Why isn't the index probe at the front of the pipeline more effective at reducing the number of rows that pass through (which, incidentally, there doesn't seem to be a stat for)?

It's not too painful to remove the offending index when nation has only 25 rows, but otherwise that first scan (without using nni) could have been really painful... is there any better way to go about this? (I've hit the problem before with other queries, though this is the first time I've analyzed it in enough detail to send to the list).

To replicate my local setup, download this compressed database (28MB): http://www.cs.utoronto.ca/~ryanjohn/tpch-10.db.xz
Then ANALYZE it after adding the following indexes:

create index snki on supplier(s_nationkey);
create index lski on lineitem(L_SUPPKEY);
create index opki on orders(O_ORDERKEY);
create index cpki on customer(C_CUSTKEY);
create index npki on nation(N_NATIONKEY);
create index nni on nation(N_NAME);

Thoughts?
Ryan

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to