[
https://issues.apache.org/jira/browse/CASSANDRA-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sankalp kohli reassigned CASSANDRA-6747:
----------------------------------------
Assignee: sankalp kohli
> MessagingService should handle failures on remote nodes.
> --------------------------------------------------------
>
> Key: CASSANDRA-6747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6747
> Project: Cassandra
> Issue Type: Improvement
> Reporter: sankalp kohli
> Assignee: sankalp kohli
> Priority: Minor
> Labels: Core
> Attachments: CASSANDRA-6747.diff
>
>
> While going through the code of MessagingService, I discovered that we don't
> handle callbacks on failure very well. If a Verb Handler on the remote
> machine throws an exception, it goes right through uncaught exception
> handler. The machine which triggered the message will keep waiting and will
> timeout. On timeout, it will so some stuff hard coded in the MS like hints
> and add to Latency. There is no way in IAsyncCallback to specify that to do
> on timeouts and also on failures.
> Here are some examples which I found will help if we enhance this system to
> also propagate failures back. So IAsyncCallback will have methods like
> onFailure.
> 1) From ActiveRepairService.prepareForRepair
> IAsyncCallback callback = new IAsyncCallback()
> {
> @Override
> public void response(MessageIn msg)
> {
> prepareLatch.countDown();
> }
> @Override
> public boolean isLatencyForSnitch()
> {
> return false;
> }
> };
> List<UUID> cfIds = new ArrayList<>(columnFamilyStores.size());
> for (ColumnFamilyStore cfs : columnFamilyStores)
> cfIds.add(cfs.metadata.cfId);
> for(InetAddress neighbour : endpoints)
> {
> PrepareMessage message = new PrepareMessage(parentRepairSession,
> cfIds, ranges);
> MessageOut<RepairMessage> msg = message.createMessage();
> MessagingService.instance().sendRR(msg, neighbour, callback);
> }
> try
> {
> prepareLatch.await(1, TimeUnit.HOURS);
> }
> catch (InterruptedException e)
> {
> parentRepairSessions.remove(parentRepairSession);
> throw new RuntimeException("Did not get replies from all
> endpoints.", e);
> }
> 2) During snapshot phase in repair, if SnapshotVerbHandler throws an
> exception, we will wait forever.
--
This message was sent by Atlassian JIRA
(v6.2#6252)