pvary opened a new pull request #1620:
URL: https://github.com/apache/iceberg/pull/1620


   Last night I was able to reproduce these failures for a while:
   ```
   org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithHiveCatalog > 
testJoinTablesParquet FAILED
      java.lang.IllegalArgumentException: Failed to executeQuery Hive query 
SHOW TABLES: Error while compiling statement: FAILED: SemanticException 
org.apache.thrift.transport.TTransportException: java.net.SocketException: 
Broken pipe (Write failed)
          Caused by:
          org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException 
org.apache.thrift.transport.TTransportException: java.net.SocketException: 
Broken pipe (Write failed)
              Caused by:
              org.apache.hadoop.hive.ql.parse.SemanticException: 
org.apache.thrift.transport.TTransportException: java.net.SocketException: 
Broken pipe (Write failed)
                  Caused by:
                  org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.thrift.transport.TTransportException: java.net.SocketException: 
Broken pipe (Write failed)
                      Caused by:
                      org.apache.thrift.transport.TTransportException: 
java.net.SocketException: Broken pipe (Write failed)
                          Caused by:
                          java.net.SocketException: Broken pipe (Write failed)
   ```
   
   The issue is that the number of worker threads is exhausted in the 
TestHiveMetastore.
   We are creating HMSClients in several different places:
   - TestHiveMetastore - we set `iceberg.hive.client-pool-size` to 2, so if 
HiveCatalog is used we create a pool with 2 connections
   - TestHiveMetastore - own `clientPool`. The size is 1, so we create 1 more 
connection
   - When we are initializing HiveServer2 for every worker and background 
thread we create 1-1 connections
   - In Hive3 HiveServer2 creates a NotificationEventPoll thread which also 
uses a connection
   
   We initialize the TestHiveMetastore with maximum of 5 threads and the 
connections are not cleaned up consistently. The main culprit is that 
HiveMetastore and HiveServer2 is not designed as something which is started and 
stopped in the same thread/JVM multiple times, so cleaning up of the Metastore 
connections during HiveServer2 restart was never a priority. ThreadLocal 
connections are kept and reused for the worker threads, and the connections are 
only closed if the Finalizer thread destroys the object.
   
   I have tested 2 fixes which were working consistently when I have tested 
them (only one of them was enough to fix the issue, but I wanted to verify my 
theory of the RC):
   - Increased the threadpool size for the `TestHiveMetastore`
   - Added a `System.gc()` to the `HiveIcebergStorageHandlerBaseTest.after()` - 
As System.gc() is not guaranteed to run a GC this is just something which might 
work
   
   I have added one more fix where I have turned off the 
NotificationEventPoller, which saves 1 connection for sure.
   When I wanted to test this fix I was unable to reproduce the issue again - 
even without any changes.
   
   Since I have no way to test the fixes again and all of them are test only 
changes, I have created a pull request which contains all of them. I really 
hope this will fix all off the flakiness issues once and for all.
   
   Could you please check @massdosage, @marton-bod, @lcspinter, @rdblue 
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to