[ 
https://issues.apache.org/jira/browse/DRILL-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15736881#comment-15736881
 ] 

ASF GitHub Bot commented on DRILL-5098:
---------------------------------------

Github user sudheeshkatkam commented on a diff in the pull request:

    https://github.com/apache/drill/pull/679#discussion_r91823124
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/client/DrillClient.java ---
    @@ -357,10 +357,54 @@ protected void afterExecute(final Runnable r, final 
Throwable t) {
             super.afterExecute(r, t);
           }
         };
    -    client = new UserClient(clientName, config, supportComplexTypes, 
allocator, eventLoopGroup, executor);
    -    logger.debug("Connecting to server {}:{}", endpoint.getAddress(), 
endpoint.getUserPort());
    -    connect(endpoint);
    -    connected = true;
    +
    +    // "tries" is max number of unique drillbit to try connecting until 
successfully connected to one of them
    +    final String connectTriesConf = (props != null) ? 
props.getProperty("tries", "5") : "5";
    +
    +    int connectTriesVal;
    +    try {
    +      connectTriesVal = Math.min(endpoints.size(), 
Integer.parseInt(connectTriesConf));
    +    } catch (NumberFormatException e) {
    +      throw new InvalidConnectionInfoException("Invalid tries value: " + 
connectTriesConf + " specified in " +
    +                                               "connection string");
    +    }
    +
    +    // If the value provided in the connection string is <=0 then override 
with 1 since we want to try connecting
    +    // at least once
    +    connectTriesVal = Math.max(1, connectTriesVal);
    +
    +    int triedEndpointIndex = 0;
    +    DrillbitEndpoint endpoint;
    +
    +    while (triedEndpointIndex < connectTriesVal) {
    +      client = new UserClient(clientName, config, supportComplexTypes, 
allocator, eventLoopGroup, executor);
    +      endpoint = endpoints.get(triedEndpointIndex);
    +      logger.debug("Connecting to server {}:{}", endpoint.getAddress(), 
endpoint.getUserPort());
    +
    +      try {
    +        connect(endpoint);
    +        connected = true;
    +        logger.info("Successfully connected to server {}:{}", 
endpoint.getAddress(), endpoint.getUserPort());
    +        break;
    +      } catch (InvalidConnectionInfoException ex) {
    +        logger.error("Connection to {}:{} failed with error {}. Not 
retrying anymore", endpoint.getAddress(),
    +                     endpoint.getUserPort(), ex.getMessage());
    +        throw ex;
    +      } catch (RpcException ex) {
    +        ++triedEndpointIndex;
    +        logger.error("Attempt {}: Failed to connect to server {}:{}", 
triedEndpointIndex, endpoint.getAddress(),
    +                     endpoint.getUserPort());
    +
    +        // Close the connection
    +        if (client.isActive()) {
    --- End diff --
    
    The loop recreates a new client anyway. So this should be closed, 
regardless of being active.


> Improving fault tolerance for connection between client and foreman node.
> -------------------------------------------------------------------------
>
>                 Key: DRILL-5098
>                 URL: https://issues.apache.org/jira/browse/DRILL-5098
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Client - JDBC
>            Reporter: Sorabh Hamirwasia
>            Assignee: Sorabh Hamirwasia
>              Labels: doc-impacting, ready-to-commit
>             Fix For: 1.10
>
>
> With DRILL-5015 we allowed support for specifying multiple Drillbits in 
> connection string and randomly choosing one out of it. Over time some of the 
> Drillbits specified in the connection string may die and the client can fail 
> to connect to Foreman node if random selection happens to be of dead Drillbit.
> Even if ZooKeeper is used for selecting a random Drillbit from the registered 
> one there is a small window when client selects one Drillbit and then that 
> Drillbit went down. The client will fail to connect to this Drillbit and 
> error out. 
> Instead if we try multiple Drillbits (configurable tries count through 
> connection string) then the probability of hitting this error window will 
> reduce in both the cases improving fault tolerance. During further 
> investigation it was also found that if there is Authentication failure then 
> we throw that error as generic RpcException. We need to improve that as well 
> to capture this case explicitly since in case of Auth failure we don't want 
> to try multiple Drillbits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to