[MariaDB developers] Review of MENT-2045: Optimizer not choosing optimal index due to unknown ICP selectivity

Michael Widenius via developers Thu, 08 Jan 2026 03:41:03 -0800

Review of:

commit c2c03308f5677b0794471dde32444b14324b3dd5
Author: Sergei Petrunia <[email protected]>
Date:   Mon Jan 5 16:13:52 2026 +0200


    MENT-2045: Optimizer not choosing optimal index due to unknown ICP
selectivity

Note that we cannot (probably) use icp selectivity calculations if we
are using rowid filter pushdown.
I don't see your patch taking this into account


diff --git a/mysql-test/main/selectivity.result
b/mysql-test/main/selectivity.result
index a9b56df2808..69c23f2ce7d 100644
--- a/mysql-test/main/selectivity.result
+++ b/mysql-test/main/selectivity.result

+drop table t0, t1;
+

Remove extra empty line above.
 #
 # Clean up

diff --git a/sql/opt_index_cond_pushdown.cc b/sql/opt_index_cond_pushdown.cc
index 6a24fa95b68..782f7e79b73 100644
--- a/sql/opt_index_cond_pushdown.cc
+++ b/sql/opt_index_cond_pushdown.cc
<cut>

@@ -440,3 +441,106 @@ void push_index_cond(JOIN_TAB *tab, uint keyno)
 }


+/*
+  @brief
+    Given a field, set its bit in key_info[...]->selective_key_parts for
+    every index it is part of that could use Index Condition Pushdown.
+
+  @detail
+    This is preparation to account for selectivity of Pushed Index Condition
+*/
+
+void mark_field_keyparts_as_selective(THD *thd, Field *field)
+{
+  if (!optimizer_flag(thd, OPTIMIZER_SWITCH_INDEX_COND_PUSHDOWN))
+    return;
+
+  key_map::Iterator it(field->part_of_key_not_clustered);
+  uint keynr;
+  while ((keynr= it++) != it.BITMAP_END)
+  {
+    KEY *key= &field->table->key_info[keynr];
+
+    if (!(field->table->file->index_flags(keynr, 0, 1) &
+          HA_DO_INDEX_COND_PUSHDOWN))
+      continue;
+
+    // Find the key part
+    for (uint kp=0; kp < key->user_defined_key_parts; kp++)
+    {
+      if (key->key_part[kp].field == field)
+      {
+        key->selective_key_parts |= key_part_map(1) << kp;
+        break;
+      }
+    }
+  }
+}

It's a bit annoying to have a double loop here. Would be nice to have this
pre-calculated.
We did discuss this on phone and concluded as that to do that would
require an long selective_key_parts[keys_in_table] variable that
is a bit excessive. Let's keep this as it is for now.

I wonder if it would be useful to also not do the marking of fields
for covering keys ?

param->table->covering_keys.is_set(keynr)


+/*
+  Get selectivity of ICP condition if one does a ref access on a given
+  index using given number of key_parts.
+
+  @param key             Index to be used
+  @param used_key_parts  Key parts to be used for ref access,
+                         0 means full index scan will be used (and so,no ICP)
+*/
+
+double get_icp_selectivity(const TABLE *table, uint key,
+                           key_part_map used_key_parts)
+{
+  if (!used_key_parts)
+    return 1.0; // Full index scan and no ICP.
+

Add comment:
/* Calculate the key parts that are used

+  key_part_map sel_parts= (table->key_info[key].selective_key_parts &
+                           ~used_key_parts);
+  if (!sel_parts)
+    return 1.0; // No selectivity on ICP fields

Suggested comment: // ICP is not supported on any of the not used key parts

+  Table_map_iterator it(sel_parts);
+  uint part;
+  double min_sel= 1.0;
+  KEY *key_info= &table->key_info[key];
+
+  /* Be conservative and pick the most selective estimate */

I assume you mean be optimistic ?
Conservative would be using max of the cond_selectivity values.
We could of course also multiply the found selectivities.
We do that in other places. Why not here?
Or use the 'square root formula' that MS-SQL is using?

+  while ((part= it.next_bit()) != it.BITMAP_END)
+  {
+    Field *field= table->field[key_info->key_part[part].field->field_index];
+    if (field->cond_selectivity < min_sel)
+      min_sel= field->cond_selectivity;
+  }
+  return min_sel;


+
+/*
+  @brief
+    Get the quick select cost accounting for ICP condition selectivity

Add that this is only called when we consider full range scan when
there was no usable ref access.

+*/
+
+double get_quick_cost_with_icp(THD *thd, QUICK_SELECT_I *quick)
+{
+  double quick_cost= quick->read_time;
+
+  /* QUICK_RANGE_SELECT can used ICP, other quick selects cannot */
+  if (quick->get_type() == QUICK_SELECT_I::QS_TYPE_RANGE)
+  {
+    /* This call will return 1.0 if ICP is not applicable for some reason */
+    double icp_sel= get_icp_selectivity(quick->head, quick->index,
+                                        quick->used_key_parts);
+    if (icp_sel < 1.0)
+    {
+      // Ok ICP is applicable and it is expected to be selective
+      double quick_index_cost=
+          quick->head->opt_range[quick->index].index_only_cost;
+      double quick_row_cost= quick_cost - quick_index_cost;
+      quick_cost= quick_index_cost + quick_row_cost * icp_sel;
+
+      Json_writer_object trace(thd, "icp_selectivity");
+      trace.add("sel", icp_sel);
+    }
+  }
+  return quick_cost;
+}

We are calling the above in best_access_path for every row combination in
case there is no proper indexes.
As all of the above are independent of the table order,
it would be better if we can cache the value in 'quick' for next
best_access_patch() call or even after we have calculated
all available ranges.

We should also consider caching get_icp_selectivity() as we are also calling
it many times in cost_for_index_read().


Note that in 11.0, you can implement this similar to what we do in apply_filter:

  adjusted_cost.row_cost.io*= selectivity;
  adjusted_cost.row_cost.cpu*= selectivity;
  adjusted_cost.copy_cost*= selectivity;        // Cost of copying row or key

You may also want to add:
adjusted_cost.index_cost.cpu+= filter_lookup_cost;

where filter_lookup_cost should be something like:

 KEY_COPY_COST + WHERE_COST/4 (this assums that the cost of pushdown
 comparison  is 1/10 of the normal where cost.
 We could have the WHERE_COST/4 in optimizer_costs.h as KEY_PUSHDOWN_WHERE_COST


diff --git a/sql/opt_range.cc b/sql/opt_range.cc
index 3b0b0de1fad..3524a5e1028 100644
--- a/sql/opt_range.cc
+++ b/sql/opt_range.cc
@@ -3569,6 +3569,7 @@ bool calculate_cond_selectivity_for_table(THD
*thd, TABLE *table, Item **cond)
       {
             uint fieldnr= key_info->key_part[0].fieldnr;
             table->field[fieldnr-1]->cond_selectivity= quick_cond_selectivity;
+            mark_field_keyparts_as_selective(thd, table->field[fieldnr-1]);
             if (i != used_key_parts)
           table->field[fieldnr-1]->cond_selectivity*= selectivity_mult;
             bitmap_clear_bit(used_fields, fieldnr-1);
@@ -3686,6 +3687,8 @@ bool calculate_cond_selectivity_for_table(THD
*thd, TABLE *table, Item **cond)
               selectivity_for_column.add("selectivity_from_histogram",
                                          key->field->cond_selectivity);
             }
+
+            mark_field_keyparts_as_selective(thd, key->field);
           }

         }
       }

I noticed that you did not call mark_field_keyparts_as_selective()
when we do sampling a bit later in the code.  Please add this to the
commit comment.


+++ b/sql/sql_select.cc
@@ -7926,14 +7942,42 @@ double cost_for_index_read(const THD *thd,
const TABLE *table, uint key,
   Preferably we should fix so that all costs are comparably.
   (All compared costs should include TIME_FOR_COMPARE for all found
   rows).
+
+  Also, account for selectivity of pushed index condition.
 */

@@ -8318,7 +8362,10 @@ best_access_path(JOIN      *join,
                             .add("index", keyinfo->name);

             if (!found_ref && table->opt_range_keys.is_set(key))
-              tmp= adjust_quick_cost(table->opt_range[key].cost, 1);
+            {
+              DBUG_ASSERT(table->opt_range[key].rows == 1);

In theory the above should be true. However the value could in theory
also be 0 or the engine could lie in records-in-range and return 2 or more.
Just to be safe, I would rather do

     table->opt_range[key].rows= 1;

just in case.  Alternative also have 'records' as a parameter as you
are now duplicating the records=... code anyway.

+              adjust_quick_cost(table, key, &tmp, &keyread_tmp);
+            }
             else
               tmp= table->file->avg_io_cost();
             tmp*= prev_record_reads(join_positions, idx,
@@ -8353,11 +8400,7 @@ best_access_path(JOIN      *join,
               {
                 records= (double) table->opt_range[key].rows;
                 trace_access_idx.add("used_range_estimates", true);
-                tmp= adjust_quick_cost(table->opt_range[key].cost,
-                                       table->opt_range[key].rows);
-                keyread_tmp= table->file->keyread_time(key, 1,
-                                                       table->opt_range[key].
-                                                       rows);
+                adjust_quick_cost(table, key, &tmp, &keyread_tmp);
                 goto got_cost;

The above code confused me as it assums that

              keyread_tmp= table->file->keyread_time(key, 1,
                                                     table->opt_range[key].
                                                     rows);
is identical to

              opt_range->index_only_cost

This may be true in many cases, but I am not sure we can guarantee
this with current or future code.  Looking at get_key_scans_params(),
we only update index_only_cost if tree->ror_scans_map.is_set(idx) is
set, which depends in tree->index_scans, which depends on
!tree->keys_map.is_clear_all().

On the other hand, the documentation in table.h is clear that index only cost
is stored in opt_range->index_only_cost

On the other hand, index_only_cost is calculated in check_quick_select()
as:

 param->table->opt_range[keynr].index_only_cost= cost->index_only_cost()

  double index_only_cost()
  {
    return IO_COEFF*idx_io_count*idx_avg_io_cost +
           CPU_COEFF*idx_cpu_cost;
  }

which is not the same as
              keyread_tmp= table->file->keyread_time(key, 1,
                                                     table->opt_range[key].
                                                     rows);

in 10.6 this could cause a notable difference in the value of keyread_cmp.

In 11.0 I removed the index_only_cost and replaced it with a function
to calculate this correctly when needed.

               }
               else
@@ -8497,8 +8540,7 @@ best_access_path(JOIN      *join,
                 table->opt_range[key].ranges == 1 +
MY_TEST(ref_or_null_part)) //(C3)
             {
               records= (double) table->opt_range[key].rows;
-              tmp= adjust_quick_cost(table->opt_range[key].cost,
-                                     table->opt_range[key].rows);
+              adjust_quick_cost(table, key, &tmp, &keyread_tmp);
               trace_access_idx.add("used_range_estimates", true);
               goto got_cost2;

See comment above about keyread_tmp. We should take it into account here too


             }
@@ -8620,7 +8662,7 @@ best_access_path(JOIN      *join,
             }

             /* Limit the number of matched rows */
-            tmp= cost_for_index_read(thd, table, key, (ha_rows) records,
+            tmp= cost_for_index_read(thd, table, key, found_part,
(ha_rows) records,
                                      (ha_rows) s->worst_seeks);
             records_for_key= (ha_rows) records;
             set_if_smaller(records_for_key, thd->variables.max_seeks_for_key);
@@ -8954,9 +8996,13 @@ best_access_path(JOIN      *join,
         access (see first else-branch below), but we don't take it into
         account here for range/index_merge access. Find out why this is so.
       */
+
       double cmp_time= (s->found_records - rnd_records) / TIME_FOR_COMPARE;
+
+      double quick_read_time= get_quick_cost_with_icp(thd, s->quick);
+
       tmp= COST_MULT(record_count,
-                     COST_ADD(s->quick->read_time, cmp_time));
+                     COST_ADD(quick_read_time, cmp_time));

       if ( s->quick->get_type() == QUICK_SELECT_I::QS_TYPE_RANGE)
       {
       Needed by non-merged semi-joins: SJ-Materialized table must have a valid
@@ -30725,6 +30774,7 @@ static bool get_range_limit_read_cost(const
JOIN_TAB *tab,
         if (ref_rows > 0)
         {
           double tmp= cost_for_index_read(tab->join->thd, table, keynr,
+                                          PREV_BITS(key_part_map, kp),
                                           ref_rows,
                                           (ha_rows) tab->worst_seeks);
           if (tmp < best_cost)
@@ -31129,7 +31179,7 @@ test_if_cheaper_ordering(bool in_join_optimizer,
             thd->variables.optimizer_adjust_secondary_key_costs|=
                 OPTIMIZER_ADJ_SEC_KEY_COST | OPTIMIZER_ADJ_DISABLE_MAX_SEEKS;

-            index_scan_time= cost_for_index_read(thd, table, nr,
+            index_scan_time= cost_for_index_read(thd, table, nr, 0,
                                                  table_records, HA_ROWS_MAX);

Not obvious why key_parts is 0 here and key_part_map in
get_range_limit_read_cost().
I assume that in the first case we are using a range and keys up to
key_part_map have been already handled while in the second case
we are doing a full index scan.

If the above is correct, please add this as comment to the above places.

Regards,
Monty
_______________________________________________
developers mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[MariaDB developers] Review of MENT-2045: Optimizer not choosing optimal index due to unknown ICP selectivity

Reply via email to