pvary commented on a change in pull request #2129:
URL: https://github.com/apache/iceberg/pull/2129#discussion_r572816829



##########
File path: 
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalogs.java
##########
@@ -36,4 +38,11 @@ public static HiveCatalog loadCatalog(Configuration conf) {
     String metastoreUri = conf.get(HiveConf.ConfVars.METASTOREURIS.varname, 
"");
     return CATALOG_CACHE.get(metastoreUri, uri -> new HiveCatalog(conf));
   }
+
+  public static HiveCatalog loadCatalog(String catalogName, Map<String, 
String> properties, Configuration conf) {
+    // metastore URI can be null in local mode
+    String metastoreUri = conf.get(HiveConf.ConfVars.METASTOREURIS.varname, 
"");
+    return CATALOG_CACHE.get(metastoreUri, uri -> (HiveCatalog) 
CatalogUtil.loadCatalog(HiveCatalog.class.getName(),
+            catalogName, properties, conf));
+  }

Review comment:
       
   > Garbage collecting connections could help if we think that the extra 
connection from the HS2 side is a problem. The AM connection probably wouldn't 
last long though?
   
   Garbage collection could help with inactive sessions, but connecting the 
catalog to the session and removing it on session close could be a better 
solution (if possible somehow).
   
   About the AM connection - I am not sure how it works with YARN, but with 
TezAM the Catalog/client should live until the session is there. My hope that 
we can get rid of this altogether in the near future.
   
   My main concern is that even with the best efforts we basically double the 
HMS connections if all of the users start to use Iceberg tables.
   
   > > ll that said, I think the name based caching is flawed if the 
configuration can be changed
   > 
   > I don't think that configuration can be changed, for the reason you cite. 
Once a catalog is created, it is configured and that can't be changed. We 
_could_ look up settings each time they are used but that seems like a lot of 
work for a problem we fundamentally don't need to solve.
   
   The thing I love about Iceberg is that any data scientist can create an S3 
bucket and put data there with the corresponding metadata and then can publish 
it. Another data scientist can just get the location of the table, and easily 
can create a table reading that data and then can aggregate these from multiple 
sources.
   These data scientists are quite capable guys and can configure their own 
catalogs, and they do not want to wait for an administrator to configure the 
catalog for them. So they would need the ability to configure the catalog 
themselves. On the other hand if they are anything like me then they are quite 
capable of making mistakes too 😄 and they will create misconfigured catalogs 
and would like to fix those without closing the Hive session.
   
   After some more thoughts I think the best solution would be to add a general 
parameter to the catalogs, like `cacheable` which defaults to true, but if it 
is false then the the catalog should skip/remove the old version with the same 
name from the cache and create a new one.
   
   This could fix the problem with the name based caching, and the "only" 
remaining one is the doubling of the HMS connection.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to