[ 
https://issues.apache.org/jira/browse/CURATOR-229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138245#comment-16138245
 ] 

IonuČ› G. Stan edited comment on CURATOR-229 at 8/23/17 11:37 AM:
-----------------------------------------------------------------

We've bumped into the same issue. Our DNS server was temporarily down and 
Curator stopped retrying to connect because ZooKepper threw a non-retryable 
exception: ({{UnknownHostException}}). In ZooKepper >= 3.4.11 it will throw an 
{{IllegalArgumentException}}. This behaviour has changed as a result of these:

  - https://issues.apache.org/jira/browse/ZOOKEEPER-1576
  - https://issues.apache.org/jira/browse/ZOOKEEPER-2614

What we've ended up doing is to register a custom {{ZookeeperFactory}} with
{{CuratorFrameworkFactory.builder()}}. That factory is responsible for creating 
new {{ZooKeeper}} instances when retrying. So we're just catching 
{{UnknownHostException}} and {{IllegalArgumentException}} there and then throw 
a {{ConnectionLossException}}, which is retry-able as far as Curator is 
concerned.

In case anyone's interested, here's the code, in Scala:

{code:title=ZookeeperFactory.scala|borderStyle=solid}
import java.net.UnknownHostException
import com.typesafe.scalalogging.LazyLogging
import org.apache.zookeeper.KeeperException.ConnectionLossException
import org.apache.zookeeper.{Watcher, ZooKeeper}

/** ZooKeeper client factory that's resilient to hostname lookup errors.
  *
  * The purpose of this wrapper is to handle hostname errors encountered
  * while creating ZooKeeper client instances. It works around these issues:
  *
  *   - https://issues.apache.org/jira/browse/ZOOKEEPER-1576
  *   - https://issues.apache.org/jira/browse/ZOOKEEPER-2614
  *   - https://issues.apache.org/jira/browse/CURATOR-229
  *
  * Curator knows how to retry a finite and predefined set of exceptions. What
  * this custom factory does is to map hostname-related exceptions into one
  * that Curator interprets as a retry-able exception. So it will keep trying
  * to establish a connection to ZooKeeper even in the face of such errors.
  *
  * @param servers The list of ZooKeeper hostnames or addresses.
  */
class ZookeeperFactory(servers: Seq[String])
  extends org.apache.curator.utils.ZookeeperFactory
    with LazyLogging {

  override def newZooKeeper(connectString: String, sessionTimeout: Int, 
watcher: Watcher, canBeReadOnly: Boolean): ZooKeeper = {
    def retry(servers: Seq[String]): ZooKeeper = {
      servers match {
        case Nil =>
          // All server hostnames have failed. Tell Curator to retry later.
          throw new ConnectionLossException()
        case remainingServers =>
          val connectString = remainingServers.mkString(",")

          try {

            new ZooKeeper(connectString, sessionTimeout, watcher, canBeReadOnly)

          } catch {
            // Apache ZooKeeper <= 3.4.10 will throw an UnknownHostException at
            // the first hostname which it can't resolve, instead of trying the
            // following hostnames in the list. So, we just drop the offending
            // hostnames from the servers list and try again.
            case e: UnknownHostException =>
              logger.warn(s"ZooKeeper client creation failed for server list: 
$connectString", e)
              retry(remainingServers.drop(1))

            // Apache ZooKeeper >= 3.4.11, will try all hostnames, but we still
            // want to retry if all of them fail right now.
            case EmptyHostProvider(e) =>
              logger.warn(s"ZooKeeper client creation failed for server list: 
$connectString", e)
              throw new ConnectionLossException()
          }
      }
    }

    retry(servers)
  }
}

object EmptyHostProvider {
  private final val MESSAGE = "A HostProvider may not be empty!"

  def unapply(e: Throwable): Option[IllegalArgumentException] = {
    e match {
      case e: IllegalArgumentException if e.getMessage == MESSAGE => Some(e)
      case _ => None
    }
  }
}
{code}

And its usage:

{code}
val zk = CuratorFrameworkFactory.builder()
    .connectString(config.servers)
    .sessionTimeoutMs(...)
    .connectionTimeoutMs(...)
    .zookeeperFactory(new ZookeeperFactory(config.servers.split(',')))
    .retryPolicy(new RetryForever(1000))
    .build()
{code}



was (Author: igstan):
We've bumped into the same issue. Our DNS server was temporarily down and 
Curator stopped retrying to connect because ZooKepper threw a non-retryable 
exception: ({{UnknownHostException}}). In ZooKepper >= 3.4.11 it will throw an 
{{IllegalArgumentException}}. This behaviour has changed as a result of these:

  - https://issues.apache.org/jira/browse/ZOOKEEPER-1576
  - https://issues.apache.org/jira/browse/ZOOKEEPER-2614

What we've ended up doing is to register a custom {{ZookeeperFactory}} with
{{CuratorFrameworkFactory.builder()}}. That factory is responsible for creating 
new {{ZooKeeper}} instance when retrying. So we're just catching 
{{UnknownHostException}} and {{IllegalArgumentException}} there and then throw 
a {{ConnectionLossException}}, which is retry-able as far as Curator is 
concerned.

In case anyone's interested, here's the code, in Scala:

{code:title=ZookeeperFactory.scala|borderStyle=solid}
import java.net.UnknownHostException
import com.typesafe.scalalogging.LazyLogging
import org.apache.zookeeper.KeeperException.ConnectionLossException
import org.apache.zookeeper.{Watcher, ZooKeeper}

/** ZooKeeper client factory that's resilient to hostname lookup errors.
  *
  * The purpose of this wrapper is to handle hostname errors encountered
  * while creating ZooKeeper client instances. It works around these issues:
  *
  *   - https://issues.apache.org/jira/browse/ZOOKEEPER-1576
  *   - https://issues.apache.org/jira/browse/ZOOKEEPER-2614
  *   - https://issues.apache.org/jira/browse/CURATOR-229
  *
  * Curator knows how to retry a finite and predefined set of exceptions. What
  * this custom factory does is to map hostname-related exceptions into one
  * that Curator interprets as a retry-able exception. So it will keep trying
  * to establish a connection to ZooKeeper even in the face of such errors.
  *
  * @param servers The list of ZooKeeper hostnames or addresses.
  */
class ZookeeperFactory(servers: Seq[String])
  extends org.apache.curator.utils.ZookeeperFactory
    with LazyLogging {

  override def newZooKeeper(connectString: String, sessionTimeout: Int, 
watcher: Watcher, canBeReadOnly: Boolean): ZooKeeper = {
    def retry(servers: Seq[String]): ZooKeeper = {
      servers match {
        case Nil =>
          // All server hostnames have failed. Tell Curator to retry later.
          throw new ConnectionLossException()
        case remainingServers =>
          val connectString = remainingServers.mkString(",")

          try {

            new ZooKeeper(connectString, sessionTimeout, watcher, canBeReadOnly)

          } catch {
            // Apache ZooKeeper <= 3.4.10 will throw an UnknownHostException at
            // the first hostname which it can't resolve, instead of trying the
            // following hostnames in the list. So, we just drop the offending
            // hostnames from the servers list and try again.
            case e: UnknownHostException =>
              logger.warn(s"ZooKeeper client creation failed for server list: 
$connectString", e)
              retry(remainingServers.drop(1))

            // Apache ZooKeeper >= 3.4.11, will try all hostnames, but we still
            // want to retry if all of them fail right now.
            case EmptyHostProvider(e) =>
              logger.warn(s"ZooKeeper client creation failed for server list: 
$connectString", e)
              throw new ConnectionLossException()
          }
      }
    }

    retry(servers)
  }
}

object EmptyHostProvider {
  private final val MESSAGE = "A HostProvider may not be empty!"

  def unapply(e: Throwable): Option[IllegalArgumentException] = {
    e match {
      case e: IllegalArgumentException if e.getMessage == MESSAGE => Some(e)
      case _ => None
    }
  }
}
{code}

And its usage:

{code}
val zk = CuratorFrameworkFactory.builder()
    .connectString(config.servers)
    .sessionTimeoutMs(...)
    .connectionTimeoutMs(...)
    .zookeeperFactory(new ZookeeperFactory(config.servers.split(',')))
    .retryPolicy(new RetryForever(1000))
    .build()
{code}


> No retry on DNS lookup failure
> ------------------------------
>
>                 Key: CURATOR-229
>                 URL: https://issues.apache.org/jira/browse/CURATOR-229
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 2.7.0
>            Reporter: Michael Putters
>
> Our environment is setup so that host names (rather than IP addresses) are 
> used when registering services.
> When disconnecting a node from the network, it will attempt to reconnect and 
> - in order to do this - attempts to resolve a host name, which fails (since 
> we have no network connectivity and a DNS server is used).
> It appears this type of exception is not retryable, and the node simply gives 
> up and never reconnects, even when the network connectivity is back.
> Is this the expected behavior? Is there any way to configure Curator so that 
> this type of exception is retryable? I had a look at 
> {{CuratorFrameworkImpl.java}} around line 768 but there doesn't seem to be 
> anything configurable.
> If this is not the expected behavior (or if it is but you don't mind making 
> it configurable), I should be able to provide a patch via a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to