Hi,

Thanks for working the feature. As a user, I find it useful, and I'd like to use
it in v18! Although I've just started start looking into it, I have a few 
questions.


(1)

Is it better to make the order of output consistent? For example, even
though there are three clauses shown in the below case, the order does not
match.
* "Filter" shows that "id1" is first.
* "Ext Stats" shows that "id2" is first.

-- An example
DROP TABLE IF EXISTS test;
CREATE TABLE test (id1 int2, id2 int4, id3 int8, value varchar(32));
INSERT INTO test (SELECT i%11, i%103, i%1009, 'hello' FROM 
generate_series(1,1000000) s(i));
create statistics test_s1 on id1, id2 from test; analyze;

=# EXPLAIN (STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
                                      QUERY PLAN                                
       
---------------------------------------------------------------------------------------
 Gather  (cost=1000.00..23092.77 rows=84311 width=20)
   Workers Planned: 2
   ->  Parallel Seq Scan on test  (cost=0.00..13661.67 rows=35130 width=20)
         Filter: ((id1 = 1) AND ((id2 = 2) OR (id2 > 10)))                      
          -- here
         Ext Stats: public.test_s1  Clauses: (((id2 = 2) OR (id2 > 10)) AND 
(id1 = 1))    -- here
(5 rows)



(2)

Do we really need the schema names without VERBOSE option? As in the above case,
"Ext Stats" shows schema name "public", even though the table name "test" isn't
shown with its schema name.

Additionally, if the VERBOSE option is specified, should the column names also 
be
printed with namespace?

=# EXPLAIN (VERBOSE, STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR 
id2 > 10);
                                      QUERY PLAN                                
       
---------------------------------------------------------------------------------------
 Gather  (cost=1000.00..22947.37 rows=82857 width=20)
   Output: id1, id2, id3, value
   Workers Planned: 2
   ->  Parallel Seq Scan on public.test  (cost=0.00..13661.67 rows=34524 
width=20)
         Output: id1, id2, id3, value
         Filter: ((test.id1 = 1) AND ((test.id2 = 2) OR (test.id2 > 10)))
         Ext Stats: public.test_s1  Clauses: (((id2 = 2) OR (id2 > 10)) AND 
(id1 = 1))   -- here
(7 rows)



(3)

I might be misunderstanding something, but do we need the clauses? Is there any
case where users would want to know the clauses? For example, wouldn't the
following be sufficient?

> Ext Stats: id1, id2 using test_s1



(4)

The extended statistics with "dependencies" or "ndistinct" option don't seem to
be shown in EXPLAIN output. Am I missing something? (Is this expected?)

I tested the examples in the documentation. Although it might work with
"mcv" option, I can't confirm that it works because "unrecognized node type"
error occurred in my environment.
https://www.postgresql.org/docs/current/sql-createstatistics.html

(It might be wrong since I'm beginner with extended stats codes.)
IIUC, the reason is that the patch only handles 
statext_mcv_clauselist_selectivity(),
and doesn't handle dependencies_clauselist_selectivity() and 
estimate_multivariate_ndistinct().


-- doesn't work with "dependencies" option?
=# \dX
                        List of extended statistics
 Schema |  Name   |     Definition     | Ndistinct | Dependencies |   MCV   
--------+---------+--------------------+-----------+--------------+---------
 public | s1      | a, b FROM t1       | (null)    | defined      | (null)
(2 rows)

=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
                                                    QUERY PLAN                  
                                   
-------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..11685.00 rows=100 width=8) (actual time=0.214..50.327 
rows=100 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t1  (cost=0.00..10675.00 rows=42 width=8) (actual 
time=30.300..46.610 rows=33 loops=3)
         Filter: ((a = 1) AND (b = 0))
         Rows Removed by Filter: 333300
 Planning Time: 0.246 ms
 Execution Time: 50.361 ms
(8 rows)

-- doesn't work with "ndistinct"?
=# \dX
                                             List of extended statistics
 Schema | Name |                            Definition                          
  | Ndistinct | Dependencies |  MCV   
--------+------+------------------------------------------------------------------+-----------+--------------+--------
 public | s3   | date_trunc('month'::text, a), date_trunc('day'::text, a) FROM 
t3 | defined   | (null)       | (null)
(1 row)

postgres(437635)=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
  WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
                                                QUERY PLAN                      
                          
----------------------------------------------------------------------------------------------------------
 Seq Scan on t3  (cost=0.00..10210.01 rows=45710 width=8) (actual 
time=0.027..143.199 rows=44640 loops=1)
   Filter: (date_trunc('month'::text, a) = '2020-01-01 00:00:00'::timestamp 
without time zone)
   Rows Removed by Filter: 480961
 Planning Time: 0.088 ms
 Execution Time: 144.590 ms
(5 rows)

-- doesn't work with "mvc". It might work, but the error happens in my 
environments
=# \dX
                    List of extended statistics
 Schema | Name |  Definition  | Ndistinct | Dependencies |   MCV   
--------+------+--------------+-----------+--------------+---------
 public | s2   | a, b FROM t2 | (null)    | (null)       | defined
(1 row)

-- I encountered the error with the query.
=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t2 WHERE (a = 1) AND (b = 1);
ERROR:  unrecognized node type: 268



Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION



Reply via email to