pvary commented on a change in pull request #2129:
URL: https://github.com/apache/iceberg/pull/2129#discussion_r571990789
##########
File path:
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalogs.java
##########
@@ -36,4 +38,11 @@ public static HiveCatalog loadCatalog(Configuration conf) {
String metastoreUri = conf.get(HiveConf.ConfVars.METASTOREURIS.varname,
"");
return CATALOG_CACHE.get(metastoreUri, uri -> new HiveCatalog(conf));
}
+
+ public static HiveCatalog loadCatalog(String catalogName, Map<String,
String> properties, Configuration conf) {
+ // metastore URI can be null in local mode
+ String metastoreUri = conf.get(HiveConf.ConfVars.METASTOREURIS.varname,
"");
+ return CATALOG_CACHE.get(metastoreUri, uri -> (HiveCatalog)
CatalogUtil.loadCatalog(HiveCatalog.class.getName(),
+ catalogName, properties, conf));
+ }
Review comment:
> 1. Are you saying that there will be 2 on HS2 and 2 on the AM because
the pool size is 2? If there aren't concurrent uses, then the pool won't
actually grow to the maximum size. Maybe the AM only has 1?
> 2. Is more than one connection abnormal? If so, Hive could set the pool
size to 1.
> 3. Does table serialization help here to avoid using the pool on the AM?
Theoretically there can be concurrent uses for the Catalogs on HS2 side, but
you are right that this will be a single connection in most cases. And the
`ClientPool` calls are serialized by the `run` method if the pool is exhausted,
so it could be ok in a sense that the queries could work but the number of HMS
connections per session is still doubled.
For writes TezAM (for Tez) or YARN Application Master (for MR) calls the
commitJob which still needs to create the Catalog. @marton-bod will look into
this how can we push the commit to the HS2 side.
> 4. Can we share client pools across catalogs?
Currently under the hood we do something like this (share the Catalog and
with that, the ClientPool too 😄)
OTOH, when connecting to HMS, clients can set their own settings:
- One partially relevant example is #2070
- More relevant is that connection timeout, directSql settings and other
stuff can be changed as well
- In theory the users connecting to the HMS could user their own credentials
(some deployments are using a service account)
So I think sharing HMS Clients between the Catalogs is not really a good
practice.
I think the best solution would be a different HiveCatalog implementation
reusing the HMS Client from the Hive session. This is guaranteed to have the
correct settings. It should be noted that the HMS client instance should be
refreshed before every call, since Hive can close the client if settings are
changed. Also it strengthens the dependency on Hive which might not be
desirable.
All that said, I think the `name` based caching is flawed if the
configuration can be changed (as it could be happen in case of Hive ATM).
Imagine this situation:
- Expert user creates a new catalog by setting the configs.
- Queries a table and founds out that one of the configs is not correctly
set (wrong URI or warehouse root)
- Fixes the wrong config
- Queries the table again. Since the name did not change the old Catalog is
returned, so everything is failing again.
Actually @lcspinter found the same issue in some of the unit tests 😄
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]