Nic Eggert created GIRAPH-1145:
----------------------------------

             Summary: nextChannel: No channels exist! error when channel is 
trying to reconnect in another thread
                 Key: GIRAPH-1145
                 URL: https://issues.apache.org/jira/browse/GIRAPH-1145
             Project: Giraph
          Issue Type: Bug
          Components: bsp
    Affects Versions: 1.2.0
            Reporter: Nic Eggert


The method {{NettyClient.getNextChannel}} has a mechanism to detect when a 
channel is no longer active. In this case, it removes it from the 
{{ChannelRotator}} while it tries to reconnect, then re-adds it once successful.

When there are more client threads than channels, it is possible for a client 
thread to call {{ChannelRotator.nextChannel}} it is empty because all channels 
are trying to reconnect. This throws {{IllegalArgumentException("nextChannel: 
No channels exist!")}}, which kills the worker.

Instead, the thread should have some way of knowing that there's a channel 
currently reconnecting so that it can wait for it. If the reconnection fails 
after the specified number of retries, the thread that is trying to reconnect 
it will throw an exception and fail the worker, so there's no concern about 
hanging here.

A workaround is to ensure that {{giraph.channelsPerServer}} >= 
{{giraph.nettyClientThreads}}, but this is often not desirable in cases with 
many workers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to