[
https://issues.apache.org/jira/browse/STORM-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rick Kellogg updated STORM-128:
-------------------------------
Component/s: storm-core
> Topology fails to start if a configured DRPC server is down
> -----------------------------------------------------------
>
> Key: STORM-128
> URL: https://issues.apache.org/jira/browse/STORM-128
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-core
> Reporter: James Xu
> Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/696
> In our environment we have 3 DRPC servers running. This was done mainly for
> availability and capacity. However, we noticed that when even one of these
> servers is down, topologies fail to start with the following exception:
> java.lang.RuntimeException: org.apache.thrift7.transport.TTransportException:
> java.net.NoRouteToHostException: No route to host
> at backtype.storm.drpc.DRPCInvocationsClient.(DRPCInvocationsClient.java:23)
> at backtype.storm.drpc.DRPCSpout.open(DRPCSpout.java:65)
> at
> storm.trident.spout.RichSpoutBatchTriggerer.open(RichSpoutBatchTriggerer.java:41)
> at backtype.storm.daemon.executor$fn__3985$fn__3997.invoke(executor.clj:460)
> at backtype.storm.util$async_loop$fn__465.invoke(util.clj:375)
> at clojure.lang.AFn.run(AFn.java:24)
> at java.lang.Thread.run(Thread.java:722)
> Caused by: org.apache.thrift7.transport.TTransportException:
> java.net.NoRouteToHostException: No route to host
> at org.apache.thrift7.transport.TSocket.open(TSocket.java:183)
> at
> org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81)
> at
> backtype.storm.drpc.DRPCInvocationsClient.connect(DRPCInvocationsClient.java:30)
> at backtype.storm.drpc.DRPCInvocationsClient.(DRPCInvocationsClient.java:21)
> ... 6 more
> Caused by: java.net.NoRouteToHostException: No route to host
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
> at java.net.Socket.connect(Socket.java:579)
> at org.apache.thrift7.transport.TSocket.open(TSocket.java:178)
> ... 9 more
> I was wondering if it makes sense to make Storm handle this gracefully
> instead of failing fast. Otherwise, the DRPC servers become a SPOF.
> If the topologies are already running the topology usually just logs an error
> message and continues.
> ----------
> dkador: +1 on figuring out how to make the DRPC stuff not a SOP. I'd be happy
> to look into it myself but not sure where to start. Any guidance?
> ----------
> rijuk: For reference, the stack trace I see when a DRPC server goes down
> while a topology is running is the following. In this case, the topology
> continues to function normally.
> [backtype.storm.drpc.DRPCSpout Thread-65]: Failed to fetch DRPC result from
> DRPC server
> org.apache.thrift7.transport.TTransportException: java.net.ConnectException:
> Connection refused
> at org.apache.thrift7.transport.TSocket.open(TSocket.java:183)
> at
> org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81)
> at
> backtype.storm.drpc.DRPCInvocationsClient.connect(DRPCInvocationsClient.java:30)
> at
> backtype.storm.drpc.DRPCInvocationsClient.fetchRequest(DRPCInvocationsClient.java:53)
> at backtype.storm.drpc.DRPCSpout.nextTuple(DRPCSpout.java:89)
> at
> storm.trident.spout.RichSpoutBatchTriggerer.nextTuple(RichSpoutBatchTriggerer.java:68)
> at
> backtype.storm.daemon.executor$fn__3985$fn__3997$fn__4026.invoke(executor.clj:502)
> at backtype.storm.util$async_loop$fn__465.invoke(util.clj:377)
> at clojure.lang.AFn.run(AFn.java:24)
> at java.lang.Thread.run(Thread.java:722)
> Caused by: java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
> at java.net.Socket.connect(Socket.java:579)
> at org.apache.thrift7.transport.TSocket.open(TSocket.java:178)
> ... 9 more
> In this case I'd the host up, but the DRPC server process was down. Hence the
> ConnectException. But, the behavior is the same even when the host is
> unreachable, except for the Exception type.
> @dkador, I'm not sure what the right solution is. One naive solution I can
> think of is to make DRPCInvocationsClient constructor rethrow TException
> instead of throwing a RuntimeException. Obviously, you'll have to make sure
> that all callers of this higher up in the stack handle this exception
> properly.
> Actually, on second thoughts that's not a good idea. You probably still want
> the DRPCInvocationsClient object to be constructed. So, maybe you can log an
> error and just eat that exception. All other methods in that class call
> "connect" if necessary anyway.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)