sankalp kohli created CASSANDRA-6747:
----------------------------------------

             Summary: MessagingService should handle failures on remote nodes.
                 Key: CASSANDRA-6747
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6747
             Project: Cassandra
          Issue Type: Improvement
            Reporter: sankalp kohli
            Priority: Minor


While going through the code of MessagingService, I discovered that we don't 
handle callbacks on failure very well. If a Verb Handler on the remote machine 
throws an exception, it goes right through uncaught exception handler. The 
machine which triggered the message will keep waiting and will timeout. On 
timeout, it will so some stuff hard coded in the MS like hints and add to 
Latency. There is no way in IAsyncCallback to specify that to do on timeouts 
and also on failures. 

Here are some examples which I found will help if we enhance this system to 
also propagate failures back.  So IAsyncCallback will have methods like 
onFailure.

1) From ActiveRepairService.prepareForRepair

   IAsyncCallback callback = new IAsyncCallback()
       {
           @Override
           public void response(MessageIn msg)
           {
               prepareLatch.countDown();
           }

           @Override
           public boolean isLatencyForSnitch()
           {
               return false;
           }
       };

       List<UUID> cfIds = new ArrayList<>(columnFamilyStores.size());
       for (ColumnFamilyStore cfs : columnFamilyStores)
           cfIds.add(cfs.metadata.cfId);

       for(InetAddress neighbour : endpoints)
       {
           PrepareMessage message = new PrepareMessage(parentRepairSession, 
cfIds, ranges);
           MessageOut<RepairMessage> msg = message.createMessage();
           MessagingService.instance().sendRR(msg, neighbour, callback);
       }
       try
       {
           prepareLatch.await(1, TimeUnit.HOURS);
       }
       catch (InterruptedException e)
       {
           parentRepairSessions.remove(parentRepairSession);
           throw new RuntimeException("Did not get replies from all 
endpoints.", e);
       }

2) During snapshot phase in repair, if SnapshotVerbHandler throws an exception, 
we will wait forever. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to