pvary opened a new pull request #1620:
URL: https://github.com/apache/iceberg/pull/1620
Last night I was able to reproduce these failures for a while:
```
org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithHiveCatalog >
testJoinTablesParquet FAILED
java.lang.IllegalArgumentException: Failed to executeQuery Hive query
SHOW TABLES: Error while compiling statement: FAILED: SemanticException
org.apache.thrift.transport.TTransportException: java.net.SocketException:
Broken pipe (Write failed)
Caused by:
org.apache.hive.service.cli.HiveSQLException: Error while compiling
statement: FAILED: SemanticException
org.apache.thrift.transport.TTransportException: java.net.SocketException:
Broken pipe (Write failed)
Caused by:
org.apache.hadoop.hive.ql.parse.SemanticException:
org.apache.thrift.transport.TTransportException: java.net.SocketException:
Broken pipe (Write failed)
Caused by:
org.apache.hadoop.hive.ql.metadata.HiveException:
org.apache.thrift.transport.TTransportException: java.net.SocketException:
Broken pipe (Write failed)
Caused by:
org.apache.thrift.transport.TTransportException:
java.net.SocketException: Broken pipe (Write failed)
Caused by:
java.net.SocketException: Broken pipe (Write failed)
```
The issue is that the number of worker threads is exhausted in the
TestHiveMetastore.
We are creating HMSClients in several different places:
- TestHiveMetastore - we set `iceberg.hive.client-pool-size` to 2, so if
HiveCatalog is used we create a pool with 2 connections
- TestHiveMetastore - own `clientPool`. The size is 1, so we create 1 more
connection
- When we are initializing HiveServer2 for every worker and background
thread we create 1-1 connections
- In Hive3 HiveServer2 creates a NotificationEventPoll thread which also
uses a connection
We initialize the TestHiveMetastore with maximum of 5 threads and the
connections are not cleaned up consistently. The main culprit is that
HiveMetastore and HiveServer2 is not designed as something which is started and
stopped in the same thread/JVM multiple times, so cleaning up of the Metastore
connections during HiveServer2 restart was never a priority. ThreadLocal
connections are kept and reused for the worker threads, and the connections are
only closed if the Finalizer thread destroys the object.
I have tested 2 fixes which were working consistently when I have tested
them (only one of them was enough to fix the issue, but I wanted to verify my
theory of the RC):
- Increased the threadpool size for the `TestHiveMetastore`
- Added a `System.gc()` to the `HiveIcebergStorageHandlerBaseTest.after()` -
As System.gc() is not guaranteed to run a GC this is just something which might
work
I have added one more fix where I have turned off the
NotificationEventPoller, which saves 1 connection for sure.
When I wanted to test this fix I was unable to reproduce the issue again -
even without any changes.
Since I have no way to test the fixes again and all of them are test only
changes, I have created a pull request which contains all of them. I really
hope this will fix all off the flakiness issues once and for all.
Could you please check @massdosage, @marton-bod, @lcspinter, @rdblue
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]