Jecarm opened a new issue #1994:
URL: https://github.com/apache/iceberg/issues/1994


   ### Backgroud
   we have started two Hive Metastore servers named _Server1_ and _Server2_, 
and the Iceberg-0.9.1 config set is
   * spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
   * spark.sql.catalog.hive_prod.type = hive
   * spark.sql.catalog.hive_prod.uri = thrift://_Server1_, thrift://_Server2_
   * spark.sql.catalog.hive_prod.clients=2
   
   It is assume that the current thread pool has already built two connections 
_Server1_ and _Server2_
   
   ### Operation and Problem
   1. We send a request to HMS using client pool connection with operation 
'table listing' or 'namespaces listing'. In Spark SQL, the operation is  `show 
tables in hive_prod.db` or  `show namespaces in hive_prod `, it will return 
corrent result.
   2. Stop HMS ‘_Server1_’ and send the same request. The expected result is 
the same as step 1, because the '_Server2_' is still active, but the actual 
result is 'null'.
   3. In general, when one HMS stops and the other is OK,  the client should 
automatically connect to the active one. However, in Iceberg, the client 
connect pool does not release failed connections.
   
   ### Analysis
   1. The `ClientPool` implements as follows:
   ```
   public <R> R run(Action<R, C, E> action) throws E, InterruptedException {
       C client = get();
       try {
         return action.run(client);
   
       } catch (Exception exc) {
         if (reconnectExc.isInstance(exc)) {
           try {
             client = reconnect(client);
           } catch (Exception ignored) {
             // if reconnection throws any exception, rethrow the original 
failure
             throw reconnectExc.cast(exc);
           }
   
           return action.run(client);
         }
   
         throw exc;
   
       } finally {
         release(client);
       }
     }
   
   ```
   
   In `HiveClientPool`, the `reconnectExc` is instance of `TTransportException` 
which is extends `TException`. Therefore, when the client throw 
`TTransportException`, it reconnects to the HMS. 
   
   ```
   protected HiveMetaStoreClient reconnect(HiveMetaStoreClient client) {
       try {
         client.close();
         client.reconnect();
       } catch (MetaException e) {
         throw new RuntimeMetaException(e, "Failed to reconnect to Hive 
Metastore");
       }
       return client;
     }
   ```
    All exceptions  except `TTransportException` do not release the previously 
established connection.
   
   However, In `HiveMetastoreClient` class,  some methods `getAllTables`, 
`getAllDatabases` throw the exception of `MetaException` which is extends 
`TException` when one HMS is inactivation.
   ```
   public List<String> getAllTables(String dbname) throws MetaException {
       try {
         return filterHook.filterTableNames(dbname, 
client.get_all_tables(dbname));
       } catch (Exception e) {
         MetaStoreUtils.logAndThrowMetaException(e);
       }
       return null;
     }
   
    /**
      * Catches exceptions that can't be handled and bundles them to 
MetaException
      *
      * @param e
      * @throws MetaException
      */
     static void logAndThrowMetaException(Exception e) throws MetaException {
       String exInfo = "Got exception: " + e.getClass().getName() + " "
           + e.getMessage();
       LOG.error(exInfo, e);
       LOG.error("Converting exception to MetaException");
       throw new MetaException(exInfo);
     }
   ```
   The `MetaException` wrapped the exception of `TTransportException`, which 
causes the obsolete connection not be released, and returns an error mesage to 
the user.
   > ERROR - Got exception: org.apache.thrift.transport.TTransportException null
   > org.apache.thrift.transport.TTransportException: null
   > ...
   
   ### Solution
   The solution that can be thought of at present is add a special detection 
for `MetaException`  which contains error message 
`org.apache.thrift.transport.TTransportException`. Once a special error message 
is detected, the current client connection is closed and reconnected.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to