[
https://issues.apache.org/jira/browse/IMPALA-8073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18004341#comment-18004341
]
ASF subversion and git services commented on IMPALA-8073:
---------------------------------------------------------
Commit 0b1a32fad8a6cc5173b0ac1585af69f08d583ed9 in impala's branch
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=0b1a32fad ]
IMPALA-13850 (part 4): Implement in-place reset for CatalogD
This patch improve the availability of CatalogD under huge INVALIDATE
METADATA operation. Previously, CatalogServiceCatalog.reset() hold
versionLock_.writeLock() for the whole reset duration. When the number
of database, tables, or functions are big, this write lock can be held
for a long time, preventing any other catalog operation from proceeding.
This patch improve the situation by:
1. Making CatalogServiceCatalog.reset() rebuild dbCache_ in place and
occasionally release the write lock between rebuild stages.
2. Fetch databases, tables, and functions metadata from MetaStore in
background using ExecutorService. Added catalog_reset_max_threads
flag to control number of threads to do parallel fetch.
In order to do so, lexicographic order must be enforced during reset()
and ensure all Db invalidation within a single stage is complete before
releasing the write lock. Stages should run in approximately the same
amount of time. A catalog operation over a database must ensure that no
reset operation is currently running, or the database name is
lexicographically less than the current database-under-invalidation.
This patch adds CatalogResetManager to do background metadata fetching
and provide helper methods to help facilitate waiting for reset
progress. CatalogServiceCatalog must hold the versionLock_.writeLock()
before calling most of CatalogResetManager methods.
These are methods in CatalogServiceCatalog class that must wait for
CatalogResetManager.waitOngoingMetadataFetch():
addDb()
addFunction()
addIncompleteTable()
addTable()
invalidateTableIfExists()
removeDb()
removeFunction()
removeTable()
renameTable()
replaceTableIfUnchanged()
tryLock()
updateDb()
InvalidateAwareDbSnapshotIterator.hasNext()
Concurrent global IM must wait until currently running global IM
complete. The waiting happens by calling waitFullMetadataFetch().
CatalogServiceCatalog.getAllDbs() get a snapshot of dbCache_ values at a
time. With this patch, it is now possible that some Db in this snapshot
maybe removed from dbCache() by concurrent reset(). Caller that cares
about snapshot integrity like CatalogServiceCatalog.getCatalogDelta()
should be careful when iterating the snapshot. It must iterate in
lexicographic order, similar like reset(), and make sure that it does
not go beyond the current database-under-invalidation. It also must skip
the Db that it is currently being inspected if Db.isRemoved() is True.
Added helper class InvalidateAwareDbSnapshot for this kind of iteration
Override CatalogServiceCatalog.getDb() and
CatalogServiceCatalog.getDbs() to wait until first reset metadata
complete or looked up Db found in cache.
Expand test_restart_catalogd_twice to test_restart_legacy_catalogd_twice
and test_restart_local_catalogd_twice. Update
CustomClusterTestSuite.wait_for_wm_init_complete() to correctly pass
timeout values to helper methods that it calls. Reduce cluster_size from
10 to 3 in few tests of test_workload_mgmt_init.py to avoid flakiness.
Fixed HMS connection leak between tests in AuthorizationStmtTest (see
IMPALA-8073).
Testing:
- Pass exhaustive tests.
Change-Id: Ib4ae2154612746b34484391c5950e74b61f85c9d
Reviewed-on: http://gerrit.cloudera.org:8080/22640
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Quanlong Huang <[email protected]>
> SentryProxy.testAddCatalog() failed in private build because of socket error
> ----------------------------------------------------------------------------
>
> Key: IMPALA-8073
> URL: https://issues.apache.org/jira/browse/IMPALA-8073
> Project: IMPALA
> Issue Type: Bug
> Components: Catalog
> Reporter: Tim Armstrong
> Assignee: Fredy Wijaya
> Priority: Critical
> Labels: flaky
> Fix For: Impala 3.2.0
>
> Attachments:
> FeSupport.impala-ec2-centos74-m5-4xlarge-ondemand-0eae.vpc.cloudera.com.jenkins.log.INFO.20190110-184543.10852
>
>
> {noformat}
> org.apache.impala.util.SentryProxyTest.testAddCatalog
> Failing for the past 1 build (Since Failed#4229 )
> Took 3 min 40 sec.
> add description
> Error Message
> Error initializing Catalog. Catalog may be empty.
> Stacktrace
> java.lang.IllegalStateException: Error initializing Catalog. Catalog may be
> empty.
> at
> org.apache.hadoop.hive.metastore.MetaStoreUtils.logAndThrowMetaException(MetaStoreUtils.java:1444)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getAllDatabases(HiveMetaStoreClient.java:1351)
> at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:150)
> at com.sun.proxy.$Proxy16.getAllDatabases(Unknown Source)
> at
> org.apache.impala.catalog.CatalogServiceCatalog.reset(CatalogServiceCatalog.java:1181)
> at
> org.apache.impala.testutil.CatalogServiceTestCatalog.createWithAuth(CatalogServiceTestCatalog.java:59)
> at
> org.apache.impala.util.SentryProxyTest.withAllPrincipalTypes(SentryProxyTest.java:603)
> at
> org.apache.impala.util.SentryProxyTest.testAddCatalog(SentryProxyTest.java:140)
> {noformat}
> The error in the log (attached) appears to be a connection to the HMS error.
> {noformat}
> 0110 18:48:08.139935 10853 HiveMetaStoreClient.java:461] Trying to connect to
> metastore with URI thrift://localhost:9083
> I0110 18:48:08.140183 10853 HiveMetaStoreClient.java:535] Opened a connection
> to metastore, current connections: 797
> W0110 18:48:28.143384 10853 HiveMetaStoreClient.java:560] set_ugi() not
> successful, Likely cause: new client talking to old server. Continuing
> without it.
> Java exception follows:
> org.apache.thrift.transport.TTransportException: java.net.SocketException:
> Connection reset
> at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
> at
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:4129)
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:4115)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:552)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:299)
> at sun.reflect.GeneratedConstructorAccessor3.newInstance(Unknown
> Source)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1750)
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:80)
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:130)
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:101)
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:94)
> at
> org.apache.impala.catalog.MetaStoreClientPool$MetaStoreClient.<init>(MetaStoreClientPool.java:93)
> at
> org.apache.impala.catalog.MetaStoreClientPool$MetaStoreClient.<init>(MetaStoreClientPool.java:72)
> at
> org.apache.impala.catalog.MetaStoreClientPool.initClients(MetaStoreClientPool.java:168)
> at org.apache.impala.catalog.Catalog.<init>(Catalog.java:100)
> at
> org.apache.impala.catalog.CatalogServiceCatalog.<init>(CatalogServiceCatalog.java:260)
> at
> org.apache.impala.testutil.CatalogServiceTestCatalog.<init>(CatalogServiceTestCatalog.java:36)
> at
> org.apache.impala.testutil.CatalogServiceTestCatalog.createWithAuth(CatalogServiceTestCatalog.java:58)
> at
> org.apache.impala.util.SentryProxyTest.withAllPrincipalTypes(SentryProxyTest.java:603)
> at
> org.apache.impala.util.SentryProxyTest.testAddCatalog(SentryProxyTest.java:140)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> at
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
> at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
> at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
> at
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
> at
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
> at
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
> at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
> Caused by: java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
> ... 52 more
> I0110 18:48:28.143533 10853 HiveMetaStoreClient.java:588] Connected to
> metastore.
> I0110 18:48:28.144161 10853 HiveMetaStoreClient.java:461] Trying to connect
> to metastore with URI thrift://localhost:9083
> I0110 18:48:28.144359 10853 HiveMetaStoreClient.java:535] Opened a connection
> to metastore, current connections: 798
> W0110 18:48:48.147451 10853 HiveMetaStoreClient.java:560] set_ugi() not
> successful, Likely cause: new client talking to old server. Continuing
> without it.
> Java exception follows:
> org.apache.thrift.transport.TTransportException: java.net.SocketException:
> Connection reset
> at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
> {noformat}
> Some initial googling suggested that it might be the server closing the
> connection because of hitting a connection limit.
> [~fredyw] could you take a look and see if you have any ideas. I wonder if
> we're leaking HMS connections in this test somehow?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]