Nic Eggert created GIRAPH-1145:
----------------------------------
Summary: nextChannel: No channels exist! error when channel is
trying to reconnect in another thread
Key: GIRAPH-1145
URL: https://issues.apache.org/jira/browse/GIRAPH-1145
Project: Giraph
Issue Type: Bug
Components: bsp
Affects Versions: 1.2.0
Reporter: Nic Eggert
The method {{NettyClient.getNextChannel}} has a mechanism to detect when a
channel is no longer active. In this case, it removes it from the
{{ChannelRotator}} while it tries to reconnect, then re-adds it once successful.
When there are more client threads than channels, it is possible for a client
thread to call {{ChannelRotator.nextChannel}} it is empty because all channels
are trying to reconnect. This throws {{IllegalArgumentException("nextChannel:
No channels exist!")}}, which kills the worker.
Instead, the thread should have some way of knowing that there's a channel
currently reconnecting so that it can wait for it. If the reconnection fails
after the specified number of retries, the thread that is trying to reconnect
it will throw an exception and fail the worker, so there's no concern about
hanging here.
A workaround is to ensure that {{giraph.channelsPerServer}} >=
{{giraph.nettyClientThreads}}, but this is often not desirable in cases with
many workers.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)