[ https://issues.apache.org/jira/browse/DRILL-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15736881#comment-15736881 ]
ASF GitHub Bot commented on DRILL-5098: --------------------------------------- Github user sudheeshkatkam commented on a diff in the pull request: https://github.com/apache/drill/pull/679#discussion_r91823124 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java --- @@ -357,10 +357,54 @@ protected void afterExecute(final Runnable r, final Throwable t) { super.afterExecute(r, t); } }; - client = new UserClient(clientName, config, supportComplexTypes, allocator, eventLoopGroup, executor); - logger.debug("Connecting to server {}:{}", endpoint.getAddress(), endpoint.getUserPort()); - connect(endpoint); - connected = true; + + // "tries" is max number of unique drillbit to try connecting until successfully connected to one of them + final String connectTriesConf = (props != null) ? props.getProperty("tries", "5") : "5"; + + int connectTriesVal; + try { + connectTriesVal = Math.min(endpoints.size(), Integer.parseInt(connectTriesConf)); + } catch (NumberFormatException e) { + throw new InvalidConnectionInfoException("Invalid tries value: " + connectTriesConf + " specified in " + + "connection string"); + } + + // If the value provided in the connection string is <=0 then override with 1 since we want to try connecting + // at least once + connectTriesVal = Math.max(1, connectTriesVal); + + int triedEndpointIndex = 0; + DrillbitEndpoint endpoint; + + while (triedEndpointIndex < connectTriesVal) { + client = new UserClient(clientName, config, supportComplexTypes, allocator, eventLoopGroup, executor); + endpoint = endpoints.get(triedEndpointIndex); + logger.debug("Connecting to server {}:{}", endpoint.getAddress(), endpoint.getUserPort()); + + try { + connect(endpoint); + connected = true; + logger.info("Successfully connected to server {}:{}", endpoint.getAddress(), endpoint.getUserPort()); + break; + } catch (InvalidConnectionInfoException ex) { + logger.error("Connection to {}:{} failed with error {}. Not retrying anymore", endpoint.getAddress(), + endpoint.getUserPort(), ex.getMessage()); + throw ex; + } catch (RpcException ex) { + ++triedEndpointIndex; + logger.error("Attempt {}: Failed to connect to server {}:{}", triedEndpointIndex, endpoint.getAddress(), + endpoint.getUserPort()); + + // Close the connection + if (client.isActive()) { --- End diff -- The loop recreates a new client anyway. So this should be closed, regardless of being active. > Improving fault tolerance for connection between client and foreman node. > ------------------------------------------------------------------------- > > Key: DRILL-5098 > URL: https://issues.apache.org/jira/browse/DRILL-5098 > Project: Apache Drill > Issue Type: Improvement > Components: Client - JDBC > Reporter: Sorabh Hamirwasia > Assignee: Sorabh Hamirwasia > Labels: doc-impacting, ready-to-commit > Fix For: 1.10 > > > With DRILL-5015 we allowed support for specifying multiple Drillbits in > connection string and randomly choosing one out of it. Over time some of the > Drillbits specified in the connection string may die and the client can fail > to connect to Foreman node if random selection happens to be of dead Drillbit. > Even if ZooKeeper is used for selecting a random Drillbit from the registered > one there is a small window when client selects one Drillbit and then that > Drillbit went down. The client will fail to connect to this Drillbit and > error out. > Instead if we try multiple Drillbits (configurable tries count through > connection string) then the probability of hitting this error window will > reduce in both the cases improving fault tolerance. During further > investigation it was also found that if there is Authentication failure then > we throw that error as generic RpcException. We need to improve that as well > to capture this case explicitly since in case of Auth failure we don't want > to try multiple Drillbits. -- This message was sent by Atlassian JIRA (v6.3.4#6332)