Hi, I hope these clarifications regarding the context and experiment
details provide a clearer understanding of my proposal.
About idx_entities_lookup:
It is proposed to add an index on a limited set of fields used in the
listEntities query for JdbcBasePersistenceImpl
("id", "catalog_id", "parent_id", "type_code", "name", "sub_type_code")
Changes using these fields instead of all fields used in the loadEntities
method were delivered in my PR [1]
The index is aimed at improving the performance of the query used in the
listEntities method of the
org.apache.polaris.persistence.relational.jdbc.JdbcBasePersistenceImpl
class, which implements the BasePersistence, IntegrationPersistence
interfaces.
In turn, this method is used in the listEntities method of the
org.apache.polaris.core.persistence.AtomicOperationMetaStoreManager class,
which implements the BaseMetaStoreManager interface.
The listEntities method, in both AtomicOperationMetaStoreManager and
JdbcBasePersistenceImpl classes, is recommended for use when limited
information about an Entity is required, while the loadEntities method is
recommended for obtaining all information about an Entity, including
properties and internal_properties.
Due to its lightweight nature, the listEntities method is used much more
widely, and improving its performance by adding an additional index will
potentially bring much greater impact.
About the previous experiment:
The experiment cited in comment [2] refers to the time before the changes
were adopted and should be read as “Main build version = [3]” and “New
build version = any version after the changes were adopted in main.”
New experiment:
> psql -U root -d POLARIS -c "select count(1) from POLARIS_SCHEMA.ENTITIES;"
count
---------
1020040
> psql -U root -d POLARIS -c "SELECT count(1) FROM POLARIS_SCHEMA.ENTITIES
WHERE catalog_id = 6911759911282783572 AND sub_type_code = 0 AND realm_id =
'POLARIS' AND parent_id = 6911759911282783572 AND type_code = 6"
count
-------
20005
Version c783de9315a8e55bc4dbda9a073dd0a64f9553c0
[4]: > cat test2.sql
pgbench -c 10 -j 2 -t 100 -f test1.sql -U root -d POLARIS
pgbench (17.6 (Debian 17.6-1.pgdg13+1))
starting vacuum...end.
transaction type: test1.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 2
maximum number of tries: 1
number of transactions per client: 100
number of transactions actually processed: 1000/1000
number of failed transactions: 0 (0.000%)
latency average = 27.275 ms
initial connection time = 6.839 ms
tps = 366.633291 (without initial connection time)
> psql -U root -d POLARIS -c "EXPLAIN ANALYZE $(cat test1.sql)"
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_entities on entities (cost=0.42..6.20 rows=1
width=320) (actual time=0.028..5.895 rows=20005 loops=1)
Index Cond: ((realm_id = 'POLARIS'::text) AND (catalog_id =
'6911759911282783572'::bigint))
Filter: ((parent_id = '6911759911282783572'::bigint) AND (sub_type_code
= 0) AND (type_code = 6))
Rows Removed by Filter: 14
Planning Time: 0.429 ms
Execution Time: 6.632 ms
Version f6bcbd29e6ef33745d06d3e3bb70f4a7297a1224
[5]: > cat test2.sql
> pgbench -c 10 -j 2 -t 100 -f test2.sql -U root -d POLARIS
pgbench (17.6 (Debian 17.6-1.pgdg13+1))
starting vacuum...end.
transaction type: test2.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 2
maximum number of tries: 1
number of transactions per client: 100
number of transactions actually processed: 1000/1000
number of failed transactions: 0 (0.000%)
latency average = 8.526 ms
initial connection time = 6.840 ms
tps = 1172.823796 (without initial connection time)
> psql -U root -d POLARIS -c "EXPLAIN ANALYZE $(cat test2.sql)"
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_entities on entities (cost=0.42..6.20 rows=1
width=50) (actual time=0.019..4.903 rows=20005 loops=1)
Index Cond: ((realm_id = 'POLARIS'::text) AND (catalog_id =
'6911759911282783572'::bigint))
Filter: ((parent_id = '6911759911282783572'::bigint) AND (sub_type_code
= 0) AND (type_code = 6))
Rows Removed by Filter: 14
Planning Time: 0.377 ms
Execution Time: 5.477 ms
> CREATE INDEX IF NOT EXISTS idx_entities_lookup ON entities USING btree
(realm_id, catalog_id, parent_id, type_code, sub_type_code) INCLUDE (id,
name);
> pgbench -c 10 -j 2 -t 100 -f test2.sql -U root -d POLARIS
pgbench (17.6 (Debian 17.6-1.pgdg13+1))
starting vacuum...end.
transaction type: test2.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 2
maximum number of tries: 1
number of transactions per client: 100
number of transactions actually processed: 1000/1000
number of failed transactions: 0 (0.000%)
latency average = 8.530 ms
initial connection time = 8.146 ms
tps = 1172.298584 (without initial connection time)
> psql -U root -d POLARIS -c "EXPLAIN ANALYZE $(cat test2.sql)"
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using idx_entities_lookup on entities (cost=0.42..6.20
rows=1 width=50) (actual time=0.026..5.215 rows=20005 loops=1)
Index Cond: ((realm_id = 'POLARIS'::text) AND (catalog_id =
'6911759911282783572'::bigint) AND (parent_id =
'6911759911282783572'::bigint) AND (type_code = 6) AND (sub_type_code = 0))
Heap Fetches: 20001
Planning Time: 0.331 ms
Execution Time: 5.785 ms
Conclusion:
Current results show that reducing columns for the selection helps decrease
query execution time, BUT adding a new index does not affect search speed.
By default, the existing idx_entities index is used along with an
additional FILTER, which provides better performance.
Alternative for alternative:
We can add the new idx_entities_lookup by default and also add a comment to
the documentation, recommending Polaris service maintainers to delete or
optimize it manually if they experience performance issues due to the extra
index.
[1]: https://github.com/apache/polaris/pull/2465
[2]: https://github.com/apache/polaris/pull/2465#issuecomment-3244547998
[3]:
https://github.com/apache/polaris/releases/tag/apache-polaris-1.1.0-incubating-rc0
[4]: > cat test1.sql
SELECT id, catalog_id, parent_id, type_code, name, entity_version,
sub_type_code, create_timestamp, drop_timestamp, purge_timestamp,
to_purge_timestamp, last_update_timestamp, properties, internal_properties,
grant_records_version, location_without_scheme
FROM POLARIS_SCHEMA.ENTITIES
WHERE catalog_id = 6911759911282783572 AND sub_type_code = 0 AND realm_id
= 'POLARIS' AND parent_id = 6911759911282783572 AND type_code = 6
[5]: > cat test2.sql
SELECT id, catalog_id, parent_id, type_code, name, sub_type_code
FROM POLARIS_SCHEMA.ENTITIES
WHERE catalog_id = 6911759911282783572 AND sub_type_code = 0 AND realm_id
= 'POLARIS' AND parent_id = 6911759911282783572 AND type_code = 6
Thanks.