Henry Robinson has posted comments on this change. Change subject: IMPALA-4278: Don't abort Catalog startup quickly if HMS is not present ......................................................................
Patch Set 4: (9 comments) Looks pretty close to me. http://gerrit.cloudera.org:8080/#/c/4842/4/fe/src/main/java/org/apache/impala/catalog/Catalog.java File fe/src/main/java/org/apache/impala/catalog/Catalog.java: PS4, Line 105: if (numClients > 0) { : metaStoreClientPool_.addClients(1, initialCnxnTimeoutSec); : metaStoreClientPool_.addClients(numClients - 1, 0); : } I think you can make this logic a method of MetaStoreClientPool - it's already called from MSCP's constructor, so a chance for sharing one method between both palces. http://gerrit.cloudera.org:8080/#/c/4842/4/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java File fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java: Line 110: private static final int META_STORE_CLIENT_POOL_SIZE = 10; is this the max size, or the initial size, or both? http://gerrit.cloudera.org:8080/#/c/4842/4/fe/src/main/java/org/apache/impala/catalog/MetaStoreClientPool.java File fe/src/main/java/org/apache/impala/catalog/MetaStoreClientPool.java: PS4, Line 76: private Add a comment to this method describing the behaviour. PS4, Line 83: IMetaStoreClient hiveClient = null; What's the point of this variable - why not assign to hiveClient_ directly? PS4, Line 99: catch (InterruptedException ignore) {} on this path you might end up not sleeping for long enough if you get interrupted. Instead, suggest you have something roughly like: long delayUntil = System.currentTimeMillis() + retryDelayMillis; if (delayUntil > endTimeMillis) throw... while (delayUntil > System.currentTimeMillis()) { try { Thread.sleep(delayUntil - System.currentTimeMillis()); } catch (...) { } } PS4, Line 161: initialCnxnTimeoutSec this is not the 'initial' timeout in the same sense that it is elsewhere, I think. Elsewhere we use it to mean "first client", but here it's used to mean "first connection". I think it's better just to call it cnxnTimeoutSec here. http://gerrit.cloudera.org:8080/#/c/4842/4/tests/experiments/test_catalog_hms_failures.py File tests/experiments/test_catalog_hms_failures.py: PS4, Line 81: 10 just from experience, suggest you make this 30s. Timeouts are surprisingly flaky in EC2-based build infrastructure. Line 114: Perhaps wait 5s or so here to be sure that the catalog is in the 'trying to connect' phase of its startup. Line 122 How about a test that confirms that if the HMS does not start within initial_hms_cnxn_timeout_s, then the catalogd fails? -- To view, visit http://gerrit.cloudera.org:8080/4842 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I83c70f939429e1d0d20284a1307f3ee1278ae047 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Henry Robinson <[email protected]> Gerrit-HasComments: Yes
