[
https://issues.apache.org/jira/browse/DRILL-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15736861#comment-15736861
]
ASF GitHub Bot commented on DRILL-5098:
---------------------------------------
Github user sohami commented on a diff in the pull request:
https://github.com/apache/drill/pull/679#discussion_r91820952
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
@@ -357,10 +357,54 @@ protected void afterExecute(final Runnable r, final
Throwable t) {
super.afterExecute(r, t);
}
};
- client = new UserClient(clientName, config, supportComplexTypes,
allocator, eventLoopGroup, executor);
- logger.debug("Connecting to server {}:{}", endpoint.getAddress(),
endpoint.getUserPort());
- connect(endpoint);
- connected = true;
+
+ // "tries" is max number of unique drillbit to try connecting until
successfully connected to one of them
+ final String connectTriesConf = (props != null) ?
props.getProperty("tries", "5") : "5";
+
+ int connectTriesVal;
+ try {
+ connectTriesVal = Math.min(endpoints.size(),
Integer.parseInt(connectTriesConf));
+ } catch (NumberFormatException e) {
+ throw new InvalidConnectionInfoException("Invalid tries value: " +
connectTriesConf + " specified in " +
+ "connection string");
+ }
+
+ // If the value provided in the connection string is <=0 then override
with 1 since we want to try connecting
+ // at least once
+ connectTriesVal = Math.max(1, connectTriesVal);
+
+ int triedEndpointIndex = 0;
+ DrillbitEndpoint endpoint;
+
+ while (triedEndpointIndex < connectTriesVal) {
+ client = new UserClient(clientName, config, supportComplexTypes,
allocator, eventLoopGroup, executor);
+ endpoint = endpoints.get(triedEndpointIndex);
+ logger.debug("Connecting to server {}:{}", endpoint.getAddress(),
endpoint.getUserPort());
+
+ try {
+ connect(endpoint);
+ connected = true;
+ logger.info("Successfully connected to server {}:{}",
endpoint.getAddress(), endpoint.getUserPort());
+ break;
+ } catch (InvalidConnectionInfoException ex) {
+ logger.error("Connection to {}:{} failed with error {}. Not
retrying anymore", endpoint.getAddress(),
+ endpoint.getUserPort(), ex.getMessage());
+ throw ex;
+ } catch (RpcException ex) {
+ ++triedEndpointIndex;
+ logger.error("Attempt {}: Failed to connect to server {}:{}",
triedEndpointIndex, endpoint.getAddress(),
+ endpoint.getUserPort());
+
+ // Close the connection
+ if (client.isActive()) {
--- End diff --
Here client is userClient and "isActive" check for all the below conditions:
return connection != null
&& connection.getChannel() != null
&& connection.getChannel().isActive();
and close --> closes the corresponding channel.
Since we are retrying so other resources will be reused for next connection
attempt. After all retries are exhausted those will be closed by the
DrillClient.close method.
> Improving fault tolerance for connection between client and foreman node.
> -------------------------------------------------------------------------
>
> Key: DRILL-5098
> URL: https://issues.apache.org/jira/browse/DRILL-5098
> Project: Apache Drill
> Issue Type: Improvement
> Components: Client - JDBC
> Reporter: Sorabh Hamirwasia
> Assignee: Sorabh Hamirwasia
> Labels: doc-impacting, ready-to-commit
> Fix For: 1.10
>
>
> With DRILL-5015 we allowed support for specifying multiple Drillbits in
> connection string and randomly choosing one out of it. Over time some of the
> Drillbits specified in the connection string may die and the client can fail
> to connect to Foreman node if random selection happens to be of dead Drillbit.
> Even if ZooKeeper is used for selecting a random Drillbit from the registered
> one there is a small window when client selects one Drillbit and then that
> Drillbit went down. The client will fail to connect to this Drillbit and
> error out.
> Instead if we try multiple Drillbits (configurable tries count through
> connection string) then the probability of hitting this error window will
> reduce in both the cases improving fault tolerance. During further
> investigation it was also found that if there is Authentication failure then
> we throw that error as generic RpcException. We need to improve that as well
> to capture this case explicitly since in case of Auth failure we don't want
> to try multiple Drillbits.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)