My apology.
Proposed changes:
Add an additional index idx_entities_lookup to the Entities metastore table.
For Postgresql metastore:
CREATE INDEX IF NOT EXISTS idx_entities_lookup
ON entities USING btree (realm_id, catalog_id, parent_id, type_code,
sub_type_code)
INCLUDE (id, name);
For H2 metastore:
CREATE INDEX IF NOT EXISTS idx_entities_lookup
ON entities (realm_id, catalog_id, parent_id, type_code, sub_type_code,
id, name);
Background:
Original request [1]
Previous PR with changes [2]
Existing indexes
* Idx_entities (realm_id, catalog_id, id)
* Idx_locations (realm_id, parent_id, location_without_scheme)
Where Entities is used:
Entities is the main table for storing all primary objects.
Descendants of org.apache.polaris.core.entity.PolarisEntityCore:
* CatalogEntity (org.apache.polaris.core.entity)
* CatalogRoleEntity (org.apache.polaris.core.entity)
* GenericTableEntity (org.apache.polaris.core.entity.table)
* IcebergTableLikeEntity (org.apache.polaris.core.entity.table)
* NamespaceEntity (org.apache.polaris.core.entity)
* PolarisBaseEntity (org.apache.polaris.core.entity)
* PolarisEntity (org.apache.polaris.core.entity)
* PolicyEntity (org.apache.polaris.core.policy)
* PrincipalEntity (org.apache.polaris.core.entity)
* PrincipalRoleEntity (org.apache.polaris.core.entity)
* TableLikeEntity (org.apache.polaris.core.entity.table)
* TaskEntity (org.apache.polaris.core.entity)
And potentially, this table is under a heavy load from all CRUD operations.
Potential improvements:
As my experiment showed, an additional index significantly helps reduce the
query SCORE for listEntities[a] by using an Index Only Scan instead of a
Bitmap Heap Scan, which could potentially have a positive impact on
reducing the load on the DBMS during read operations.
The experiment:
I conducted a simple experiment and measured performance using EXPLAIN for
the postgresql metastore;
The original comment: [3]
Potential risks:
I conducted my experiments on only 1 thread and only on one read operation,
and possible additional loads on re-indexing operations during intensive
write operations were not taken into account. Such cases are possible
because the Entities table is used for all objects and roles.
Alternatives:
The alternative proposed by Dmitri Bourlatchkov is to delegate the decision
to add the index to the administrators of the Polaris deployment. This is
because each product is unique, and the resource maintainers should
determine the optimal configuration for their needs.
Recommendations can be placed in the metastore configuration section: [4]
[1]: https://github.com/apache/polaris/issues/2352
[2]: https://github.com/apache/polaris/pull/2465
[3]: https://github.com/apache/polaris/pull/2465#issuecomment-3244547998
[4]:
https://polaris.apache.org/in-dev/unreleased/configuring-polaris-for-production/#metastore-configuration
Regards, Artur
The experiment details:
----
Test ENV
OS: `24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025;
root:xnu-11417.121.6~2/RELEASE_ARM64_T6041 arm64`
Storage: Docker image `postgres:17`
Rows in total = 899040.
Expected rows result = 200005.
Results
Main build version
EXPLAIN ANALYZE
(SELECT id, catalog_id, parent_id, type_code, name, entity_version,
sub_type_code, create_timestamp, drop_timestamp, purge_timestamp,
to_purge_timestamp, last_update_timestamp, properties, internal_properties,
grant_records_version, location_without_scheme
FROM POLARIS_SCHEMA.ENTITIES
WHERE catalog_id = 6911759911282783572 AND sub_type_code = 0 AND realm_id
= 'POLARIS' AND parent_id = 6911759911282783572 AND type_code = 6);
Bitmap Heap Scan on entities (cost=3960.68..22260.05 rows=44861 width=140)
(actual time=16.287..43.851 rows=200005 loops=1)
Recheck Cond: ((realm_id = 'POLARIS'::text) AND (catalog_id =
'6911759911282783572'::bigint) AND (parent_id =
'6911759911282783572'::bigint) AND (type_code = 6))
Filter: (sub_type_code = 0)
Heap Blocks: exact=3848
-> Bitmap Index Scan on constraint_name (cost=0.00..3949.47 rows=44861
width=0) (actual time=15.792..15.792 rows=200005 loops=1)
Index Cond: ((realm_id = 'POLARIS'::text) AND (catalog_id =
'6911759911282783572'::bigint) AND (parent_id =
'6911759911282783572'::bigint) AND (type_code = 6))
Planning Time: 0.239 ms
Execution Time: 49.253 ms
New build version
CREATE INDEX IF NOT EXISTS idx_entities_lookup
ON entities USING btree (realm_id, catalog_id, parent_id, type_code,
sub_type_code)
INCLUDE (id, name);
EXPLAIN ANALYZE
(SELECT id, catalog_id, parent_id, type_code, name, sub_type_code
FROM POLARIS_SCHEMA.ENTITIES
WHERE catalog_id = 6911759911282783572 AND sub_type_code = 0 AND realm_id
= 'POLARIS' AND parent_id = 6911759911282783572 AND type_code = 6);
Index Only Scan using idx_entities_lookup on entities (cost=0.42..2975.89
rows=42744 width=50) (actual time=0.030..29.235 rows=200005 loops=1)
Index Cond: ((realm_id = 'POLARIS'::text) AND (catalog_id =
'6911759911282783572'::bigint) AND (parent_id =
'6911759911282783572'::bigint) AND (type_code = 6) AND (sub_type_code = 0))
Heap Fetches: 0
Planning Time: 0.089 ms
Execution Time: 35.317 ms
Conclusion
Using `Index Only Scan` instead of `Bitmap Heap Scan` helps:
- reduce query execution time up to 28%;
- reducing query cost up to 87%;
----