pvary commented on a change in pull request #2129:
URL: https://github.com/apache/iceberg/pull/2129#discussion_r572816829
##########
File path:
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalogs.java
##########
@@ -36,4 +38,11 @@ public static HiveCatalog loadCatalog(Configuration conf) {
String metastoreUri = conf.get(HiveConf.ConfVars.METASTOREURIS.varname,
"");
return CATALOG_CACHE.get(metastoreUri, uri -> new HiveCatalog(conf));
}
+
+ public static HiveCatalog loadCatalog(String catalogName, Map<String,
String> properties, Configuration conf) {
+ // metastore URI can be null in local mode
+ String metastoreUri = conf.get(HiveConf.ConfVars.METASTOREURIS.varname,
"");
+ return CATALOG_CACHE.get(metastoreUri, uri -> (HiveCatalog)
CatalogUtil.loadCatalog(HiveCatalog.class.getName(),
+ catalogName, properties, conf));
+ }
Review comment:
> Garbage collecting connections could help if we think that the extra
connection from the HS2 side is a problem. The AM connection probably wouldn't
last long though?
Garbage collection could help with inactive sessions, but connecting the
catalog to the session and removing it on session close could be a better
solution (if possible somehow).
About the AM connection - I am not sure how it works with YARN, but with
TezAM the Catalog/client should live until the session is there. My hope that
we can get rid of this altogether in the near future.
My main concern is that even with the best efforts we basically double the
HMS connections if all of the users start to use Iceberg tables.
> > ll that said, I think the name based caching is flawed if the
configuration can be changed
>
> I don't think that configuration can be changed, for the reason you cite.
Once a catalog is created, it is configured and that can't be changed. We
_could_ look up settings each time they are used but that seems like a lot of
work for a problem we fundamentally don't need to solve.
The thing I love about Iceberg is that any data scientist can create an S3
bucket and put data there with the corresponding metadata and then can publish
it. Another data scientist can just get the location of the table, and easily
can create a table reading that data and then can aggregate these from multiple
sources.
These data scientists are quite capable guys and can configure their own
catalogs, and they do not want to wait for an administrator to configure the
catalog for them. So they would need the ability to configure the catalog
themselves. On the other hand if they are anything like me then they are quite
capable of making mistakes too 😄 and they will create misconfigured catalogs
and would like to fix those without closing the Hive session.
After some more thoughts I think the best solution would be to add a general
parameter to the catalogs, like `cacheable` which defaults to true, but if it
is false then the the catalog should skip/remove the old version with the same
name from the cache and create a new one.
This could fix the problem with the name based caching, and the "only"
remaining one is the doubling of the HMS connection.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]