[
https://issues.apache.org/jira/browse/DRILL-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746598#comment-15746598
]
ASF GitHub Bot commented on DRILL-5098:
---------------------------------------
Github user sudheeshkatkam commented on a diff in the pull request:
https://github.com/apache/drill/pull/679#discussion_r92291348
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -357,10 +357,53 @@ protected void afterExecute(final Runnable r, final
Throwable t) {
super.afterExecute(r, t);
}
};
- client = new UserClient(clientName, config, supportComplexTypes,
allocator, eventLoopGroup, executor);
- logger.debug("Connecting to server {}:{}", endpoint.getAddress(),
endpoint.getUserPort());
- connect(endpoint);
- connected = true;
+
+ // "tries" is max number of unique drillbit to try connecting until
successfully connected to one of them
+ final String connectTriesConf = (props != null) ?
props.getProperty("tries", "5") : "5";
+
+ int connectTriesVal;
+ try {
+ connectTriesVal = Math.min(endpoints.size(),
Integer.parseInt(connectTriesConf));
+ } catch (NumberFormatException e) {
+ throw new InvalidConnectionInfoException("Invalid tries value: " +
connectTriesConf + " specified in " +
+ "connection string");
+ }
+
+ // If the value provided in the connection string is <=0 then override
with 1 since we want to try connecting
+ // at least once
+ connectTriesVal = Math.max(1, connectTriesVal);
+
+ int triedEndpointIndex = 0;
+ DrillbitEndpoint endpoint;
+
+ while (triedEndpointIndex < connectTriesVal) {
+ client = new UserClient(clientName, config, supportComplexTypes,
allocator, eventLoopGroup, executor);
+ endpoint = endpoints.get(triedEndpointIndex);
+ logger.debug("Connecting to server {}:{}", endpoint.getAddress(),
endpoint.getUserPort());
+
+ try {
+ connect(endpoint);
+ connected = true;
+ logger.info("Successfully connected to server {}:{}",
endpoint.getAddress(), endpoint.getUserPort());
+ break;
+ } catch (InvalidConnectionInfoException ex) {
+ logger.error("Connection to {}:{} failed with error {}. Not
retrying anymore", endpoint.getAddress(),
+ endpoint.getUserPort(), ex.getMessage());
+ throw ex;
+ } catch (RpcException ex) {
+ ++triedEndpointIndex;
+ logger.error("Attempt {}: Failed to connect to server {}:{}",
triedEndpointIndex, endpoint.getAddress(),
+ endpoint.getUserPort());
+
+ // Close the connection
+ client.close();
--- End diff --
Is `UserClient#close` idempotent?
Looks like `if (triedEndpointIndex == connectTriesVal)`, then
`UserClient#close` is called twice, here and in `DrillClient#close`. The is
unlike the previous catch clause i.e. `catch (InvalidConnectionInfoException
ex)`.
> Improving fault tolerance for connection between client and foreman node.
> -------------------------------------------------------------------------
>
> Key: DRILL-5098
> URL: https://issues.apache.org/jira/browse/DRILL-5098
> Project: Apache Drill
> Issue Type: Improvement
> Components: Client - JDBC
> Reporter: Sorabh Hamirwasia
> Assignee: Sorabh Hamirwasia
> Labels: doc-impacting, ready-to-commit
> Fix For: 1.10
>
>
> With DRILL-5015 we allowed support for specifying multiple Drillbits in
> connection string and randomly choosing one out of it. Over time some of the
> Drillbits specified in the connection string may die and the client can fail
> to connect to Foreman node if random selection happens to be of dead Drillbit.
> Even if ZooKeeper is used for selecting a random Drillbit from the registered
> one there is a small window when client selects one Drillbit and then that
> Drillbit went down. The client will fail to connect to this Drillbit and
> error out.
> Instead if we try multiple Drillbits (configurable tries count through
> connection string) then the probability of hitting this error window will
> reduce in both the cases improving fault tolerance. During further
> investigation it was also found that if there is Authentication failure then
> we throw that error as generic RpcException. We need to improve that as well
> to capture this case explicitly since in case of Auth failure we don't want
> to try multiple Drillbits.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)