Recently we upgraded from storm 0.9.1 to 0.9.3. One observation is that,
when a downstream bolt is reassigned to a different server, the upstream
spout will bail out after reconnecting for 50 minutes -  (1) default max
retries is 300; (2) every retry times out in 10 seconds. see code pointer
and log below. We end up setting the max retries to 6 (so max time in retry
is 6 x 10seconds = 1 minute).

*One question we have is regarding the reconnect to downstream bolt. *

When a downstream bolt is reassigned to a different node, we expect the
upstream spout/bolt reload topology allocation from Zookeeper after
reconnect fails to reach downstream bolt. However, current behavior is
that, the upstream spout/bolt just exits when reconnect to downstream bolt
fails; after it is restarted, it loads the right topology allocation.

Comments?

Code pointers and logs -

https://github.com/apache/storm/blob/v0.9.3/conf/defaults.yaml



storm.messaging.netty.max_retries: 300


https://raw.githubusercontent.com/apache/storm/v0.9.3/storm-core/src/jvm/backtype/storm/messaging/netty/Client.java



    /**

     * We will retry connection with exponential back-off policy

     */

    private synchronized void connect() {

        try {



            Channel channel = channelRef.get();

            if (channel != null && channel.isConnected()) {

                return;

            }



            int tried = 0;

            //setting channel to null to make sure we throw an exception
when reconnection fails

            channel = null;

            while (tried <= max_retries) {



                LOG.info("Reconnect started for {}... [{}]", name(), tried);

                LOG.debug("connection started...");



                ChannelFuture future = bootstrap.connect(remote_addr);

                future.awaitUninterruptibly();

                Channel current = future.getChannel();

                if (!future.isSuccess()) {

                    if (null != current) {

                        current.close();

                    }

                } else {

                    channel = current;

                    break;

                }

                Thread.sleep(retryPolicy.getSleepTimeMs(tried, 0));

                tried++;

            }

            if (null != channel) {

                LOG.info("connection established to a remote host " +
name() + ", " + channel.toString());

                channelRef.set(channel);

            } else {

                close();

                throw new RuntimeException("Remote address is not
reachable. We will close this client " + name());

            }

        } catch (InterruptedException e) {

            throw new RuntimeException("connection failed " + name(), e);

        }

    }



The log for reference -


2015-07-20 15:55:48.288 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-CH1SCH060020712/10.153.3.160:6711... [0]

2015-07-20 15:55:48.289 b.s.m.n.Client [DEBUG] connection started...

2015-07-20 15:55:48.288 b.s.m.n.Client [DEBUG] 1 request(s) sent

2015-07-20 15:55:48.289 b.s.m.n.Client [DEBUG] 1 request(s) sent



2015-07-20 15:55:58.394 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-CH1SCH060020712/10.153.3.160:6711... [1]

2015-07-20 15:55:58.394 b.s.m.n.Client [DEBUG] connection started...



2015-07-20 15:56:08.501 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-CH1SCH060020712/10.153.3.160:6711... [2]

2015-07-20 15:56:08.501 b.s.m.n.Client [DEBUG] connection started...

…

2015-07-20 16:46:04.652 b.s.d.worker [DEBUG] Doing heartbeat
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1437435964,
:storm-id "BingLayercakeAllLayersCh1dTest2-48-1437429062", :executors
#{[328 328] [-1 -1]}, :port 6706}

2015-07-20 16:46:04.951 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-CH1SCH060020712/10.153.3.160:6711... [284]

2015-07-20 16:46:04.951 b.s.m.n.Client [DEBUG] connection started...

...

2015-07-20 16:48:59.798 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-CH1SCH060020712/10.153.3.160:6711... [300]

2015-07-20 16:48:59.798 b.s.m.n.Client [DEBUG] connection started...

..

2015-07-20 16:49:10.743 b.s.m.n.Client [INFO] Closing Netty Client
Netty-Client-CH1SCH060020712/10.153.3.160:6711

2015-07-20 16:49:10.743 b.s.m.n.Client [INFO] Waiting for pending batchs to
be sent with Netty-Client-CH1SCH060020712/10.153.3.160:6711..., timeout:
600000ms, pendings: 0

2015-07-20 16:49:10.743 b.s.m.n.Client [DEBUG] channel CH1SCH060020712/
10.153.3.160:6711 closed

2015-07-20 16:49:10.744 b.s.util [ERROR] Async loop died!

java.lang.RuntimeException: java.lang.RuntimeException: Remote address is
not reachable. We will close this client Netty-Client-CH1SCH060020712/
10.153.3.160:6711

Reply via email to