[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

2021-05-20 Thread GitBox


marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636196743



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
##
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.Properties;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.mr.Catalogs;
+
+public class IcebergTableUtil {
+
+  private IcebergTableUtil() {
+
+  }
+
+  /**
+   * Load the iceberg table either from the {@link QueryState} or through the 
configured catalog.
+   * @param configuration a Hadoop configuration
+   * @param properties controlling properties
+   * @return
+   */
+  static Table getTable(Configuration configuration, Properties properties) {
+// look for the table object stored in the query state. If it's null, it 
means the table was not loaded yet
+// within the same query therefore we claim it through the Catalogs API 
and then store it in query state.
+QueryState queryState = SessionState.get()
+
.getQueryState(configuration.get(HiveConf.ConfVars.HIVEQUERYID.varname));
+Table table = null;
+if (queryState != null) {

Review comment:
   Do we expect the query state to be null at any point during compilation? 
If so, that would result in constant reloading of the table. So maybe we should 
either remove the null checks to simplify the code, or at least log it if it's 
not there? What do you think?

##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java
##
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import java.util.Properties;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.ql.QueryState;
+import org.apache.hadoop.hive.ql.session.SessionState;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.mr.Catalogs;
+
+public class IcebergTableUtil {
+
+  private IcebergTableUtil() {
+
+  }
+
+  /**
+   * Load the iceberg table either from the {@link QueryState} or through the 
configured catalog.
+   * @param configuration a Hadoop configuration
+   * @param properties controlling properties
+   * @return
+   */
+  static Table getTable(Configuration configuration, Properties properties) {
+// look for the table object stored in the query state. If it's null, it 
means the table was not loaded yet

Review comment:
   nit: I think this comment belongs to the javadoc more naturally, but 
it's up to you, I don't feel strongly on that




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org



[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

2021-05-20 Thread GitBox


marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636173569



##
File path: ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java
##
@@ -349,6 +350,20 @@
 
   private final AtomicLong sparkSessionId = new AtomicLong();
 
+  private final Map queryStateMap = new HashMap<>();
+
+  public Object getQueryState(String queryId) {

Review comment:
   Can we update the return type too so we don't need any casting?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org



[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

2021-05-20 Thread GitBox


marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636154183



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##
@@ -97,13 +98,15 @@ public static Table loadTable(Configuration conf, 
Properties props) {
 props.getProperty(InputFormatConfig.CATALOG_NAME));
   }
 
-  private static Table loadTable(Configuration conf, String tableIdentifier, 
String tableLocation,
- String catalogName) {
+  private static Table loadTable(Configuration conf, String tableIdentifier, 
String tableLocation, String catalogName) {
 Optional catalog = loadCatalog(conf, catalogName);
 
+Table cachedTable = null;

Review comment:
   Are these changes still necessary?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org



[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

2021-05-20 Thread GitBox


marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636151988



##
File path: ql/src/java/org/apache/hadoop/hive/ql/QueryState.java
##
@@ -59,6 +60,11 @@
 
   static public final String USERID_TAG = "userid";
 
+  /**
+   * map of tables involved in the query.
+   */
+  private final Map tableMap = new HashMap<>();

Review comment:
   Can we make this a generic map so we can store things other than tables 
here? We will need this container also for other info that - in the absence of 
this new query state feature - we previously had to force into the conf, such 
as write commit info (jobID, vertexID, taskNum), CTAS info (is the query ctas? 
what's the ctas target table name), etc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org



[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

2021-05-20 Thread GitBox


marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r636149696



##
File path: ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java
##
@@ -349,6 +349,20 @@
 
   private final AtomicLong sparkSessionId = new AtomicLong();
 
+  private final Map queryStateMap = new HashMap<>();
+
+  public Object getQueryState(String queryId) {
+return queryStateMap.get(queryId);
+  }
+
+  public void addQueryState(String queryId, Object queryState) {

Review comment:
   Shouldn't we only accept only QueryState objects here?
   i.e. `public void addQueryState(String queryId, QueryState queryState) {`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org



[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

2021-05-18 Thread GitBox


marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r634500584



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##
@@ -284,4 +334,42 @@ private static String getCatalogType(Configuration conf, 
String catalogName) {
   }
 }
   }
+
+  public static class TableCache {
+private static final Cache tableCache = 
Caffeine.newBuilder()
+.expireAfterAccess(12, TimeUnit.HOURS).build();
+private static final Logger LOG = 
LoggerFactory.getLogger(TableCache.class);
+
+public static void removeTables(Configuration conf) {
+  String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+  if (queryId != null && !queryId.isEmpty()) {
+Set queryKeys = tableCache.asMap().keySet().stream()
+.filter(k -> k.startsWith(queryId)).collect(Collectors.toSet());
+tableCache.invalidateAll(queryKeys);
+  } else {
+LOG.warn("Query id is not present in config, therefore no Iceberg 
table object is removed " +

Review comment:
   I think what we do expect is that the `hive.query.id` is always 
populated in the config. If it's absent, that's a warning sign and something 
which could prevent us from evicting from the cache. This could be a useful log 
line in case we see that the cache gets overly large and we're looking for the 
root cause why the iceberg tables are not evicted.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org



[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

2021-05-12 Thread GitBox


marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630909697



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, 
String catalogName) {
   }
 }
   }
+
+  public static class TableCache {
+private static final Cache tableCache = 
Caffeine.newBuilder()
+.expireAfterAccess(12, TimeUnit.HOURS).build();
+private static final Set keys = new TreeSet<>();

Review comment:
   Instead of maintaining the key set manually, you might be able to get 
the keys in the hook method by using `tableCache.asMap().keySet()`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org



[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

2021-05-12 Thread GitBox


marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630926018



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, 
String catalogName) {
   }
 }
   }
+
+  public static class TableCache {
+private static final Cache tableCache = 
Caffeine.newBuilder()
+.expireAfterAccess(12, TimeUnit.HOURS).build();

Review comment:
   Shall we decrease this a bit? Perhaps 30 minutes would also suffice? Not 
sure what a "long" compilation time means for Hive usually. Decreasing it could 
give us some extra safety against cache pollution in case the hook-based 
eviction doesn't always go smoothly. Worst case in a very long compilation 
scenario, we might reload the table again eventually, but I'm assuming 
compilations that take 30+ minutes should be pretty rare?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org



[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

2021-05-12 Thread GitBox


marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630918129



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergQueryLifeTimeHook.java
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import org.apache.hadoop.hive.ql.hooks.QueryLifeTimeHook;
+import org.apache.hadoop.hive.ql.hooks.QueryLifeTimeHookContext;
+import org.apache.iceberg.mr.Catalogs;
+
+public class HiveIcebergQueryLifeTimeHook implements QueryLifeTimeHook {
+  @Override
+  public void beforeCompile(QueryLifeTimeHookContext ctx) {
+
+  }
+
+  @Override
+  public void afterCompile(QueryLifeTimeHookContext ctx, boolean hasError) {
+  }
+
+  @Override
+  public void beforeExecution(QueryLifeTimeHookContext ctx) {
+Catalogs.TableCache.removeTable(ctx.getHiveConf());

Review comment:
   Shouldn't we call this in `afterCompile`? If there's a compilation error 
at some point, we won't even get to this method, no?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org



[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

2021-05-12 Thread GitBox


marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630915112



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, 
String catalogName) {
   }
 }
   }
+
+  public static class TableCache {
+private static final Cache tableCache = 
Caffeine.newBuilder()
+.expireAfterAccess(12, TimeUnit.HOURS).build();
+private static final Set keys = new TreeSet<>();
+
+public static void removeTable(Configuration conf) {
+  String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+  if (queryId != null && !queryId.isEmpty()) {
+Set queryKeys = keys.stream().filter(k -> 
k.startsWith(queryId)).collect(Collectors.toSet());
+tableCache.invalidateAll(queryKeys);
+keys.removeAll(queryKeys);
+  }

Review comment:
   shall we add an else branch here with some warn logging? It might be a 
problem here worth noting if the query id is not present in the config, because 
that will cause cache pollution over time. In the other methods, if the query 
id is not present we just reload the table, so that's not as big of a deal




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org



[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

2021-05-12 Thread GitBox


marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630910187



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, 
String catalogName) {
   }
 }
   }
+
+  public static class TableCache {
+private static final Cache tableCache = 
Caffeine.newBuilder()
+.expireAfterAccess(12, TimeUnit.HOURS).build();
+private static final Set keys = new TreeSet<>();
+
+public static void removeTable(Configuration conf) {
+  String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+  if (queryId != null && !queryId.isEmpty()) {
+Set queryKeys = keys.stream().filter(k -> 
k.startsWith(queryId)).collect(Collectors.toSet());
+tableCache.invalidateAll(queryKeys);
+keys.removeAll(queryKeys);
+  }
+}
+
+public static Table getTable(Configuration conf, String catalogName, 
String tableIdentifier) {
+  String queryId = conf.get(HiveConf.ConfVars.HIVEQUERYID.varname);
+  if (queryId != null  && !queryId.isEmpty()) {

Review comment:
   nit: double spaces after the queryId != null part, here and below

##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergQueryLifeTimeHook.java
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr.hive;
+
+import org.apache.hadoop.hive.ql.hooks.QueryLifeTimeHook;
+import org.apache.hadoop.hive.ql.hooks.QueryLifeTimeHookContext;
+import org.apache.iceberg.mr.Catalogs;
+
+public class HiveIcebergQueryLifeTimeHook implements QueryLifeTimeHook {
+  @Override
+  public void beforeCompile(QueryLifeTimeHookContext ctx) {
+
+  }
+
+  @Override
+  public void afterCompile(QueryLifeTimeHookContext ctx, boolean hasError) {
+  }

Review comment:
   nit: missing new line, like for beforeCompile and afterExecution




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org



[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

2021-05-12 Thread GitBox


marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630909854



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, 
String catalogName) {
   }
 }
   }
+
+  public static class TableCache {
+private static final Cache tableCache = 
Caffeine.newBuilder()
+.expireAfterAccess(12, TimeUnit.HOURS).build();
+private static final Set keys = new TreeSet<>();
+
+public static void removeTable(Configuration conf) {

Review comment:
   Can we rename this to make it clear that it removes _all_ tables for a 
given query? `removeTables`, `removeTablesForQuery`, or something similar that 
you'd prefer. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org



[GitHub] [hive] marton-bod commented on a change in pull request #2261: HIVE-25102: Cache Iceberg table objects within same query

2021-05-12 Thread GitBox


marton-bod commented on a change in pull request #2261:
URL: https://github.com/apache/hive/pull/2261#discussion_r630909697



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/Catalogs.java
##
@@ -284,4 +333,40 @@ private static String getCatalogType(Configuration conf, 
String catalogName) {
   }
 }
   }
+
+  public static class TableCache {
+private static final Cache tableCache = 
Caffeine.newBuilder()
+.expireAfterAccess(12, TimeUnit.HOURS).build();
+private static final Set keys = new TreeSet<>();

Review comment:
   Instead of maintaining the key set manually, you might be able to get 
the keys in the hook method by using 
`tableCache.getNativeCache().asMap().keySet()`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org