It could be that there are two separate flaky test issues with not releasing connections in Flink and Spark. I don't think that the HiveCatalog code has been changed much recently, which would point toward problems elsewhere.
I think one good reason to use HiveCatalog is to catch problems like these, where the catalog should be closed but isn't. On Fri, Jan 8, 2021 at 9:23 AM Steven Wu <stevenz...@gmail.com> wrote: > I use the try-with-resource pattern in the FLIP-27 dev branch. I saw this > problem in Flink tests with the master branch too (although less likely). > With the FLIP-27 dev branch and an additional DeleteReadTests, it almost > happened 100%. > > Also, the Spark module (in the master branch) also has this flaky test > problem. Like Ryan mentioned earlier, maybe some common code (or pattern) > causes the issue. It became more flaky recently probably because there are > more tests added. > > Regardless, I still have the question if most unit tests should use > HiveCatalog? Why not the cheaper HadoopCatalog except for tests > specifically targeting HiveCatalog? Can that help speed up the tests? > > On Fri, Jan 8, 2021 at 12:06 AM OpenInx <open...@gmail.com> wrote: > >> OK, there's a try-with-resource to close the TableLoader in >> FlinkInputFormat [1]. so we don't have to do the extra try-with-resource >> in PR 2051 ( I will close that). >> >> Under my host, I did not reproduce your connection leak issues when >> running TestFlinkInputFormatReaderDeletes. Did you have any extra usage >> about the table loader and forget to close it in your flip-27 dev branch ? >> >> [1]. >> https://github.com/apache/iceberg/blob/7645ceba65044184be192a7194a38729133b2e50/flink/src/main/java/org/apache/iceberg/flink/source/FlinkInputFormat.java#L77 >> >> On Fri, Jan 8, 2021 at 3:36 PM OpenInx <open...@gmail.com> wrote: >> >>> > I was able to almost 100% reproduce the HiveMetaStoreClient aborted >>> connection problem locally with Flink tests after adding >>> another DeleteReadTests for the new FLIP-27 source impl in my dev branch >>> >>> I think I found the cause why it's easy to fail. The >>> TestFlinkInputFormatReaderDeletes will create a new CatalogLoader [1] for >>> loading table purposes inside the FlinkInputFormat. >>> >>> TestHelpers.readRowData(inputFormat, rowType).forEach(rowData -> { >>> RowDataWrapper wrapper = new RowDataWrapper(rowType, >>> projected.asStruct()); >>> set.add(wrapper.wrap(rowData)); >>> }); >>> >>> When TestHelpers#readRowData, it will open a new catalog ( that means >>> opening a new hive connection). But after we finished the read processing, >>> we did not close the TableLoader, which leaks the catalog connection. I >>> opened a PR [2] to fix this issue, will it work in your branch ? >>> >>> I think it's worth keeping those hive catalog unit tests so that we >>> could detect those connection leak issues in time. >>> >>> [1]. >>> https://github.com/apache/iceberg/blob/4436c92928f4b3b90839a26bf6a656902733261f/flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkInputFormatReaderDeletes.java#L114 >>> [2]. https://github.com/apache/iceberg/pull/2051/files >>> >>> On Fri, Jan 8, 2021 at 5:48 AM Steven Wu <stevenz...@gmail.com> wrote: >>> >>>> Ryan/OpenInx, thanks a lot for the pointers. >>>> >>>> I was able to almost 100% reproduce the HiveMetaStoreClient aborted >>>> connection problem locally with Flink tests after adding >>>> another DeleteReadTests for the new FLIP-27 source impl in my dev branch. I >>>> don't see the problem anymore after switching the Flink DeleteReadTests >>>> from the HiveCatalog (requiring expensive TestHiveMetastore) to >>>> HadoopCatalog. >>>> >>>> There is still a base test class FlinkTestBase using the HiveCatalog. I >>>> am wondering if there is a value for using the more expensive HiveCatalog >>>> than the HadoopCatalog? >>>> >>>> On Wed, Jan 6, 2021 at 6:22 PM OpenInx <open...@gmail.com> wrote: >>>> >>>>> I encountered a similar issue when supporting hive-site.xml for flink >>>>> hive catalog. Here is the discussion and solution before: >>>>> https://github.com/apache/iceberg/pull/1586#discussion_r509453461 >>>>> >>>>> It's a connection leak issue. >>>>> >>>>> >>>>> On Thu, Jan 7, 2021 at 10:06 AM Ryan Blue <rb...@netflix.com.invalid> >>>>> wrote: >>>>> >>>>>> I've noticed this too. I haven't had a chance to track down what's >>>>>> causing it yet. I've seen it in Spark tests, so it looks like there may >>>>>> be >>>>>> a problem that affects both. Probably a connection leak in the common >>>>>> code. >>>>>> >>>>>> On Wed, Jan 6, 2021 at 3:44 PM Steven Wu <stevenz...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I have noticed some flakiness with Flink and Spark tests both >>>>>>> locally and in CI checks. @zhangjun0x01 also reported the same problem >>>>>>> with >>>>>>> iceberg-spark3-extensions. Below is a full stack trace from a >>>>>>> local run for Flink tests. >>>>>>> >>>>>>> The flakiness might be recent regression, as the tests were stable >>>>>>> for me until recently. Any recent hive dep change? Anyone have any >>>>>>> ideas? >>>>>>> >>>>>>> org.apache.iceberg.flink.source.TestIcebergSourceReaderDeletes > >>>>>>> testMixedPositionAndEqualityDeletes[fileFormat=ORC] FAILED >>>>>>> >>>>>>> java.lang.RuntimeException: Failed to get table info from >>>>>>> metastore default.test >>>>>>> >>>>>>> at >>>>>>> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:142) >>>>>>> >>>>>>> at >>>>>>> org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:86) >>>>>>> >>>>>>> at >>>>>>> org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:69) >>>>>>> >>>>>>> at >>>>>>> org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:92) >>>>>>> >>>>>>> at >>>>>>> org.apache.iceberg.flink.TableLoader$CatalogTableLoader.loadTable(TableLoader.java:113) >>>>>>> >>>>>>> at >>>>>>> org.apache.iceberg.flink.source.TestIcebergSourceReaderDeletes.rowSet(TestIcebergSourceReaderDeletes.java:90) >>>>>>> >>>>>>> >>>>>>> Caused by: >>>>>>> >>>>>>> org.apache.thrift.transport.TTransportException: >>>>>>> java.net.SocketException: Broken pipe (Write failed) >>>>>>> >>>>>>> at >>>>>>> org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:161) >>>>>>> >>>>>>> at >>>>>>> org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:73) >>>>>>> >>>>>>> at >>>>>>> org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62) >>>>>>> >>>>>>> at >>>>>>> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_table_req(ThriftHiveMetastore.java:1561) >>>>>>> >>>>>>> at >>>>>>> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table_req(ThriftHiveMetastore.java:1553) >>>>>>> >>>>>>> at >>>>>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1350) >>>>>>> >>>>>>> at >>>>>>> org.apache.iceberg.hive.HiveTableOperations.lambda$doRefresh$0(HiveTableOperations.java:130) >>>>>>> >>>>>>> at >>>>>>> org.apache.iceberg.hive.ClientPool.run(ClientPool.java:65) >>>>>>> >>>>>>> at >>>>>>> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:130) >>>>>>> >>>>>>> ... 5 more >>>>>>> >>>>>>> >>>>>>> Caused by: >>>>>>> >>>>>>> java.net.SocketException: Broken pipe (Write failed) >>>>>>> >>>>>>> at java.net.SocketOutputStream.socketWrite0(Native >>>>>>> Method) >>>>>>> >>>>>>> at >>>>>>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) >>>>>>> >>>>>>> at >>>>>>> java.net.SocketOutputStream.write(SocketOutputStream.java:155) >>>>>>> >>>>>>> at >>>>>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) >>>>>>> >>>>>>> at >>>>>>> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) >>>>>>> >>>>>>> at >>>>>>> org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:159) >>>>>>> >>>>>>> ... 13 more >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> -- Ryan Blue Software Engineer Netflix