[MariaDB developers] Review of MDEV-37732 In MDEV-36861, analyze Q21

Michael Widenius via developers Mon, 06 Oct 2025 12:46:56 -0700

Hi!

Discussion points: (Done per phone)


Not sure if the suggested way to calculate fan out is correct for
dependent sub queries.

For example:

select * from t11 where t1.key in (select seq from seq_1_to_10);

In this case the total records out for the join is:
records_per_key(t1.key) * 10

This is independent of the number of rows in t1.

Because of this, I don't know how fanout (10/10) = 1 would help here.

Things does not get better with:

select * from t1 where t1.primary_key in (select 1 from seq_1_to_10);

In this case fanout would be 1/10, but total(records_out) would be 1.

In case of
select * from l1 where exists (select 1 from lineitem l2 where ...)

Assuming that the where uses l1. This means that the query should have
a records_out based on

count(l1) * chance_of_any_match(WHERE).

I don't see how this depends on the number of distinct values generated
anywhere in the sub query.

select * from people  where person_name in (select owner from laptops);

MIN(records_in_table, rec_per_key(person_name) * count (distinct owner))

--------------

In case of sub queries using exists:

select * from li1 where l1.L_RECEIPTDATE > l1.L_COMMITDATE and
exists (select 1 from l2 where l2.L_ORDERKEY = l1.L_ORDERKEY and
l2.L_SUPPKEY <> l1.L_SUPPKEY)

Here the number rows matching the WHERE is not really important as
we only need to know if there is at least one match.
Number of distinct values does not matter here.

I still don't know what the right fanout is here. I would assume the
fanout should be 1 as we cannot know if the sub query will have
any matches for the current row in l1.
In normal joins if we have a WHERE clause like:
l2.L_ORDERKEY = l1.L_ORDERKEY
we do assume that there is a rec_per_key(l1.order_key) matching rows.
However as we are using exists() this gives us 1 row.

We could use selectivity for doing a better match.
In the case of
           l2.L_ORDERKEY = l1.L_ORDERKEY and
           l2.L_SUPPKEY <> l1.L_SUPPKEY

We could assume the chance of using this row from l1 is the
selectivity and fanout of

rec_per_key(l1.l_supp_key) * selectivity(l2.L_SUPPKEY <> l1.L_SUPPKEY)

And the total 'records_out' after distinct elimination is:
MIN(table->records_out,
   rec_per_key(l1.l_supp_key) * selectivity(l2.L_SUPPKEY <> l1.L_SUPPKEY);

select * from people where person_name in (select owner from laptops);

The estimate of rows we have to accept from people is:

MIN(records_out(people), rec_per_key(person_name) * count (distinct owner))

-----------------

Review part

Commit comment.


     select * from t1 where t1.col1 in (select col2 from t2 where corr_cond)

      and the join order of
      t2, full scan rows=1M
      t1, ref access on t1.col1=t2.col2, rows=10

    it would conclude that "outer fanout" is 10.

Please add what the outer fanout is in this case.
I assume the total records combinations would be something like

MY_MAX(records(t1),
       records(t1)/count(distinct t1.col1) * count(distinct t2.col2))

>From which the fanout is count(distinct t2.col2) / count(distinct t1.col1)

-----


mysql-test/main/subselect4.test
Lacks \n at end of file

There is also several files in the patch that has end space that shows up
in git show. Please fix.

---------------

estimate_distinct_cardinality()

You are doing a test of:

if (!has_one_bit_set(map))

Shouldn't you test also that all fields are from the same table and the
table used is from the sub query ?

I am asking because of the following comment in
estimate_table_group_cardinality:
"Compute number of groups for a GROUP BY list that refers to a single table"

Looking at the code, it looks like estimate_item_list_cardinality
estimates the cardinality per table and multiply them, so things are
probably ok. I suggest we update the comment for
estimate_table_group_cardinality to

Compute number of groups for a GROUP BY list that refers to a single table.
If the list contains fields from multiple tables, the fields should be
sorted by table.

@return return in *group_list the position for first element of next
        table or 'end' if no more tables.

----------

By the way, I noticed on thing that we could improve in
estimate_table_group_cardinality.

  /*
    join->map2table is not set yet, so find the table in leaf_tables list
  */
  List_iterator<TABLE_LIST> li(join->select_lex->leaf_tables);
  TABLE_LIST *tbl;
  while ((tbl= li++))
  {
    if (tbl->table->map == table_bit)
    {
      table= tbl->table;
      table_records_after_where= rows2double(tbl->table->reginfo.
                                             join_tab->found_records);
      break;
    }
  }

We do not need the above loop at all as we can instead do:

  for (p= *group_list;
       p != end && (*p)->used_tables() == table_bit;
       p++)
  {
    Item *real= (*p)->real_item();
    if (real->type() == Item::FIELD_ITEM)
    {
      Field *field= ((Item_field*)real)->field;
+     if (!table)
+     {
+       table= field->table;
+       table_records_after_where= rows2double(table->reginfo.
+                                              join_tab->found_records);
+     }
      possible_keys.merge(field->part_of_key);
      columns.append(field->field_index);
    }

This means you can remove the changes you did in
estimate_table_group_cardinality()

Except this row:

     table_records_after_where= rows2double(tbl->table->reginfo.
                                            join_tab->found_records);

We need it in case of 'goto whole_table'
However you can probably have the loop for finding the table there
as this is a non common case.

Note I would actuall prefer if we could use records_out for the current
table instead as records_out is our best guess of how many records we
will use after the where clause.

Looking the code in estimate_item_list_cardinality,
Would not 'records_after_where' always be same or bigger than
join_output_card we have in the call to estimate_item_list_cardinality?
If this is the case, we could just return DBL_MAX in case of 'whole'

Note that we should probably have a test for the value of
estimate_table_group_cardinality() that we do not get an overflow for:

new_card *= estimate_table_group_cardinality(join, &pos, group_cols.end());

Something like:

estimate= estimate_table_group_cardinality(join, &pos, group_cols.end());
if (estimate > join_output_card ||
   (new_card= new_card*estimate) > join_output_card)

---------

@ -1974,6 +1975,7 @@ static bool convert_subq_to_sj(JOIN *parent_join, Item_in_
subselect *subq_pred)
     Item_func_eq *item_eq=
       new (thd->mem_root) Item_func_eq(thd, left_exp_orig,
                                        subq_lex->ref_pointer_array[0]);
+    sj_nest->sj_inner_expr_list.push_back(&item_eq->arguments()[1]);
     if (!item_eq)
       goto restore_tl_and_exit;

Move your change after test if item_eq allocation failed.

subselect *subq_pred)
     Item_func_eq *item_eq=
       new (thd->mem_root) Item_func_eq(thd, left_exp_orig,
                                        subq_lex->ref_pointer_array[0]);
+    sj_nest->sj_inner_expr_list.push_back(&item_eq->arguments()[1]);

Why not use:

sj_nest->sj_inner_expr_list.push_back(subq_lex->ref_pointer_array[0]);

(Looks like the generated code will be close as the as arguments() is inline
But the compiler could cache the value, which makes my option better.

Same comment for:
       sj_nest->sj_on_expr= and_items(thd, sj_nest->sj_on_expr, item_eq);
+      sj_nest->sj_inner_expr_list.push_back(&item_eq->arguments()[1]);

and
+      sj_nest->sj_inner_expr_list.push_back(row->addr(i));

-----

-  Optimize semi-join nests that could be run with sj-materialization
+  Optimize semi-join nests that could be run with sj-materialization.
+  Also, get fanout estimate for Duplicate Weedout.

>From our previous discussion, shouldn't we try to estimate the cardinality
of the outer table instead of the fanout of the sub query ?

    /* semi-join nests with only constant tables are not valid */
    /// DBUG_ASSERT(sj_nest->sj_inner_tables & ~join->const_table_map);

+    sj_nest->sj_fanout_ratio= 1.0;
     sj_nest->sj_mat_info= NULL;
+    bool sj_nest_handled= false;

Please move the bool declaration it start of the while loop.
(I prefer to find all declared loop variables in one place)

+        // Save the subquery's "fanout" (see below)
+        sj_nest->sj_fanout_ratio= subjoin_out_rows / sjm->rows;

This is always >= 1.0 as we know that sjm->rows is always <= subjoin_out_rows.

----
Safety fix (noticed while reviewing)

table.cc

void TABLE::init(THD *thd, TABLE_LIST *tl)
..
  opt_range_condition_rows=0;

Should be
opt_range_condition_rows= ULONGLONG_MAX;

(As this us upper bound of rows in the table, we shouldn't set up to 0)

-----------------

Review notes (for reviewer)

We call optimize_semijoin_nests(join, all_table_map) only in
- make_join_statistic()
  -  Called by start of optimizer to setup join structures
- JOIN::reoptimize()  - Note that reoptimize will also call choose_plan()
  - Called by choose_subquery_plan
    - Called by make_join_statistic() after choose_plan()
    - Called by update and delete

---------

I checked the your test case and I think that you should also add
EXPLAIN EXTENDED for the two new queries.

Filtered changes for U from 1.00 to 11.11, which is a clear user visible
change.

-------------
+    if (optimizer_new_mode(thd, NEW_MODE_FIX_SEMIJOIN_DUPS_WEEDOUT_CHECK) &&
+        !sj_nest_handled && (sj_nest->sj_inner_tables & ~join->const_table_map)
...
      double distinct_rows=
        estimate_distinct_cardinality(join, // parent join
                                      sj_nest,
                                      sj_nest->sj_in_exprs,
                                      subjoin_out_rows);
      sj_nest->sj_fanout_ratio= subjoin_out_rows / distinct_rows;

Shouldn't we here ensure that sj_fanout_ratio is not < 1 and that distinct_rows
is not 0

----------

Here is a patch to take selectivity into account when calculate semijoin_nests
It also simplifies the calculations a bit...

diff --git a/sql/opt_subselect.cc b/sql/opt_subselect.cc
index 36044906f44..6eff15143e1 100644
--- a/sql/opt_subselect.cc
+++ b/sql/opt_subselect.cc
@@ -2617,24 +2617,20 @@ bool optimize_semijoin_nests(JOIN *join,
table_map all_table_map)
         */
         SELECT_LEX *subq_select= sj_nest->sj_subq_pred->unit->first_select();
         {
-          for (uint i=0 ; i < join->const_tables + sjm->tables ; i++)
-          {
-            JOIN_TAB *tab= join->best_positions[i].table;
-            join->map2table[tab->table->tablenr]= tab;
-          }
           table_map map= 0;
+          double rows= 1.0;
           for (uint i=0; i < subq_select->item_list.elements; i++)
             map|= subq_select->ref_pointer_array[i]->used_tables();
-          map= map & ~PSEUDO_TABLE_BITS;
-          Table_map_iterator tm_it(map);
-          int tableno;
-          double rows= 1.0;
-          while ((tableno = tm_it.next_bit()) !=
Table_map_iterator::BITMAP_END)
+          for (uint i=0 ; i < join->const_tables + sjm->tables ; i++)
           {
-            ha_rows tbl_rows=join->map2table[tableno]->
-                             table->opt_range_condition_rows;
-
-            rows= COST_MULT(rows, rows2double(tbl_rows));
+            TABLE *table= join->best_positions[i].table->table;
+            if (map & (1ULL << table->tablenr))
+            {
+              ha_rows opt_rows= MY_MIN(table->opt_range_condition_rows,
+                                       table->stat_records() *
+                                       table->cond_selectivity);
+              rows= COST_MULT(rows, opt_rows);
+            }
           }
           sjm->rows= MY_MIN(sjm->rows, rows);
         }

---------
+/*
+  @brief
+    Compute inner and outer fanout for a join order subset which will be
+    handled by Duplicate Elimination.

It would be good to have some example of queries this function will
optimize and how.


+    if (optimizer_new_mode(join->thd,
NEW_MODE_FIX_SEMIJOIN_DUPS_WEEDOUT_CHECK))
+      get_inner_outer_fanouts(join, idx, &sj_inner_fanout, &sj_outer_fanout);
+

Add a comment that this will recompute sj_inner_fanout and sj_outer_fanout
computed above.  Only dups_cost, dups_remove_fanout and
temptable_rec_size will be used from above.

   total_fanout = COST_MULT(total_fanout, p->records_out);

remove space before =

    if (p->table->emb_sj_nest)
    {
      if (!(dups_removed_fanout & p->table->emb_sj_nest->sj_inner_tables))
      {
        // We haven't seen a table from this semi-join.
        // Put it into semi_join_fanout.
        semi_join_fanout *= p->table->emb_sj_nest->sj_fanout_ratio;
      }
      dups_removed_fanout |= p->table->table->map;
    }

It's a bit confusing that dups_removed_fanout only adds on bit, while
we checks for multiple bits with 'sj_inner_tables'.
We do we only use the first found table in the semijoin?

This code would be more clear if we would do:

dups_removed_fanout |= p->table->emb_sj_nest->sj_inner_tables;

  // Fanout caused by duplicates generated by all semi-joins we cover:
  *sj_inner_fanout= semi_join_fanout;
  // This is what will be left after duplicates are removed:
  *sj_outer_fanout= total_fanout / semi_join_fanout;

We should add a check that semi_join_fanout is not 0

--------------

I noticed we use a lot of found_records in opt_subselect.cc

It would be better if we always use records_out or alternative if we want
records before affected by other tables use:

 MY_MIN(table->opt_range_condition_rows,
                                       table->stat_records() *
                                       table->cond_selectivity);

Would also be good to at some point fix so that we do not have that
many different 'records' in table and best_positions.

In most cases, I assume we can replace found_records with
opt_range_condition_rows or the above expression. I think this is the right
choice for void Duplicate_weedout_picker::get_inner_outer_fanout

Notice that the code in get_quick_record_count guarantees that
   DBUG_ASSERT(select->quick->records >= table->opt_range_condition_rows);
   Note that get_quick_record_count returns select->quick->records which
   sets found_records.

Another alternative is to replace both opt_range_condition_rows and
found_records_found a new variable that would take both opt_range_condtion_rows
and selectivity into account like above.

This would clarify things:
- table->records_found    Number of rows after taking range and selectivity
                          into account
- best_positions->records_out  Number of rows after taking table access and
                               records_found into account.
- best_positions->records_init Number of rows accessed by the current index
                               not taking selectivity into account but
                               taking filtering into account.

We should also removes usage of records_read in table_after_join_selectivity
and usage of best_positions->records.
Note that I have a patch for 'main' to remove records_after_filter and
prev_record_reads.

----------------
  */
  {
    double outer_fanout=1.0;
    table_map outer_clean_tables=0;

Remove the { and move the declarations to function start.

    - outer tables that do not depend on anything in in the inner table
in in -> in

     *sj_inner_fanout= total_fanout / *sj_outer_fanout;

Please add protection if *sj_outer_fanout is 0.

Would be good to have in optimizer_trace both versions of sj_inner_fanout and
sj_outer_fanout calculated by
Duplicate_weedout_picker::get_inner_outer_fanouts()
As both numbers should be 'reasonable' it would help find issues with the
current formulas.

---------------

@@ -11493,7 +11493,7 @@ get_costs_for_tables(JOIN *join, table_map remaining_tab
les, uint idx,
   {
     table_map real_table_bit= s->table->map;
     if ((*allowed_tables & real_table_bit) &&
-        !(remaining_tables & s->dependent))
+        !(remaining_tables & join->allowed_tables & s->dependent))
     {

Why was this change needed.
Can it have any effect of old queries not using semi joins?

--------------

+++ b/sql/table.h
@@ -2568,6 +2568,10 @@ struct TABLE_LIST
   Name_resolution_context *on_context;  /* For ON expressions */
   Table_function_json_table *table_function; /* If it's the table function. */

+  /*
+    CAUTION: This may not be a valid expression. Luckily, we only check
+    whether this is NULL.
+  */
   Item          *sj_on_expr;

This not strictly true:

in eliminate_tables_for_list()

      if (tbl->sj_on_expr)
        tables_used_on_left |= tbl->sj_on_expr->used_tables();
    }

Grep also finds:

     if (sj_nest->sj_on_expr->fix_fields_if_needed(thd, &sj_nest->sj_on_expr))
    goto restore_tl_and_exit;

 parent_join->conds= and_items(thd, parent_join->conds, sj_nest->sj_on_expr)

etc.

Please clarify the above message when sj_on_expr can cause problems.

-------

Please read trough my comments and act on those you find important.
(I did add a lot of comments as I am not that familiar with the embedded
semijoin code and needed to clarify things for myself).

Regards,
Monty
_______________________________________________
developers mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[MariaDB developers] Review of MDEV-37732 In MDEV-36861, analyze Q21

Reply via email to