Hi Artur, thanks for sharing the experiment results. Do we have any data on the write side? The entity table holds almost everything in Polaris. We cannot ignore the write side impact. Also if the read perf doesn't improve a lot, we may leave the index as is. We can document how the extra index helps in certain scenarios. So that users can make a decision per their use cases. Either a blog or a new doc page works for me.
Yufei On Fri, Sep 5, 2025 at 9:54 AM artur rakhmatulin < [email protected]> wrote: > Hi, I hope these clarifications regarding the context and experiment > details provide a clearer understanding of my proposal. > > About idx_entities_lookup: > > It is proposed to add an index on a limited set of fields used in the > listEntities query for JdbcBasePersistenceImpl > ("id", "catalog_id", "parent_id", "type_code", "name", "sub_type_code") > Changes using these fields instead of all fields used in the loadEntities > method were delivered in my PR [1] > > The index is aimed at improving the performance of the query used in the > listEntities method of the > org.apache.polaris.persistence.relational.jdbc.JdbcBasePersistenceImpl > class, which implements the BasePersistence, IntegrationPersistence > interfaces. > In turn, this method is used in the listEntities method of the > org.apache.polaris.core.persistence.AtomicOperationMetaStoreManager class, > which implements the BaseMetaStoreManager interface. > The listEntities method, in both AtomicOperationMetaStoreManager and > JdbcBasePersistenceImpl classes, is recommended for use when limited > information about an Entity is required, while the loadEntities method is > recommended for obtaining all information about an Entity, including > properties and internal_properties. > Due to its lightweight nature, the listEntities method is used much more > widely, and improving its performance by adding an additional index will > potentially bring much greater impact. > > About the previous experiment: > > The experiment cited in comment [2] refers to the time before the changes > were adopted and should be read as “Main build version = [3]” and “New > build version = any version after the changes were adopted in main.” > > New experiment: > > > psql -U root -d POLARIS -c "select count(1) from > POLARIS_SCHEMA.ENTITIES;" > count > --------- > 1020040 > > > psql -U root -d POLARIS -c "SELECT count(1) FROM POLARIS_SCHEMA.ENTITIES > WHERE catalog_id = 6911759911282783572 AND sub_type_code = 0 AND realm_id = > 'POLARIS' AND parent_id = 6911759911282783572 AND type_code = 6" > count > ------- > 20005 > > Version c783de9315a8e55bc4dbda9a073dd0a64f9553c0 > > [4]: > cat test2.sql > > pgbench -c 10 -j 2 -t 100 -f test1.sql -U root -d POLARIS > pgbench (17.6 (Debian 17.6-1.pgdg13+1)) > starting vacuum...end. > transaction type: test1.sql > scaling factor: 1 > query mode: simple > number of clients: 10 > number of threads: 2 > maximum number of tries: 1 > number of transactions per client: 100 > number of transactions actually processed: 1000/1000 > number of failed transactions: 0 (0.000%) > latency average = 27.275 ms > initial connection time = 6.839 ms > tps = 366.633291 (without initial connection time) > > > psql -U root -d POLARIS -c "EXPLAIN ANALYZE $(cat test1.sql)" > QUERY PLAN > > ----------------------------------------------------------------------------------------------------------------------------- > Index Scan using idx_entities on entities (cost=0.42..6.20 rows=1 > width=320) (actual time=0.028..5.895 rows=20005 loops=1) > Index Cond: ((realm_id = 'POLARIS'::text) AND (catalog_id = > '6911759911282783572'::bigint)) > Filter: ((parent_id = '6911759911282783572'::bigint) AND (sub_type_code > = 0) AND (type_code = 6)) > Rows Removed by Filter: 14 > Planning Time: 0.429 ms > Execution Time: 6.632 ms > > > Version f6bcbd29e6ef33745d06d3e3bb70f4a7297a1224 > > [5]: > cat test2.sql > > > pgbench -c 10 -j 2 -t 100 -f test2.sql -U root -d POLARIS > pgbench (17.6 (Debian 17.6-1.pgdg13+1)) > starting vacuum...end. > transaction type: test2.sql > scaling factor: 1 > query mode: simple > number of clients: 10 > number of threads: 2 > maximum number of tries: 1 > number of transactions per client: 100 > number of transactions actually processed: 1000/1000 > number of failed transactions: 0 (0.000%) > latency average = 8.526 ms > initial connection time = 6.840 ms > tps = 1172.823796 (without initial connection time) > > > psql -U root -d POLARIS -c "EXPLAIN ANALYZE $(cat test2.sql)" > QUERY PLAN > > ---------------------------------------------------------------------------------------------------------------------------- > Index Scan using idx_entities on entities (cost=0.42..6.20 rows=1 > width=50) (actual time=0.019..4.903 rows=20005 loops=1) > Index Cond: ((realm_id = 'POLARIS'::text) AND (catalog_id = > '6911759911282783572'::bigint)) > Filter: ((parent_id = '6911759911282783572'::bigint) AND (sub_type_code > = 0) AND (type_code = 6)) > Rows Removed by Filter: 14 > Planning Time: 0.377 ms > Execution Time: 5.477 ms > > > > CREATE INDEX IF NOT EXISTS idx_entities_lookup ON entities USING btree > (realm_id, catalog_id, parent_id, type_code, sub_type_code) INCLUDE (id, > name); > > > pgbench -c 10 -j 2 -t 100 -f test2.sql -U root -d POLARIS > pgbench (17.6 (Debian 17.6-1.pgdg13+1)) > starting vacuum...end. > transaction type: test2.sql > scaling factor: 1 > query mode: simple > number of clients: 10 > number of threads: 2 > maximum number of tries: 1 > number of transactions per client: 100 > number of transactions actually processed: 1000/1000 > number of failed transactions: 0 (0.000%) > latency average = 8.530 ms > initial connection time = 8.146 ms > tps = 1172.298584 (without initial connection time) > > > psql -U root -d POLARIS -c "EXPLAIN ANALYZE $(cat test2.sql)" > QUERY PLAN > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Index Only Scan using idx_entities_lookup on entities (cost=0.42..6.20 > rows=1 width=50) (actual time=0.026..5.215 rows=20005 loops=1) > Index Cond: ((realm_id = 'POLARIS'::text) AND (catalog_id = > '6911759911282783572'::bigint) AND (parent_id = > '6911759911282783572'::bigint) AND (type_code = 6) AND (sub_type_code = 0)) > Heap Fetches: 20001 > Planning Time: 0.331 ms > Execution Time: 5.785 ms > > Conclusion: > Current results show that reducing columns for the selection helps decrease > query execution time, BUT adding a new index does not affect search speed. > By default, the existing idx_entities index is used along with an > additional FILTER, which provides better performance. > > Alternative for alternative: > > We can add the new idx_entities_lookup by default and also add a comment to > the documentation, recommending Polaris service maintainers to delete or > optimize it manually if they experience performance issues due to the extra > index. > > [1]: https://github.com/apache/polaris/pull/2465 > [2]: https://github.com/apache/polaris/pull/2465#issuecomment-3244547998 > [3]: > > https://github.com/apache/polaris/releases/tag/apache-polaris-1.1.0-incubating-rc0 > > > [4]: > cat test1.sql > SELECT id, catalog_id, parent_id, type_code, name, entity_version, > sub_type_code, create_timestamp, drop_timestamp, purge_timestamp, > to_purge_timestamp, last_update_timestamp, properties, internal_properties, > grant_records_version, location_without_scheme > FROM POLARIS_SCHEMA.ENTITIES > WHERE catalog_id = 6911759911282783572 AND sub_type_code = 0 AND realm_id > = 'POLARIS' AND parent_id = 6911759911282783572 AND type_code = 6 > > [5]: > cat test2.sql > SELECT id, catalog_id, parent_id, type_code, name, sub_type_code > FROM POLARIS_SCHEMA.ENTITIES > WHERE catalog_id = 6911759911282783572 AND sub_type_code = 0 AND realm_id > = 'POLARIS' AND parent_id = 6911759911282783572 AND type_code = 6 > > Thanks. >
