[
https://issues.apache.org/jira/browse/CASSANDRA-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749378#comment-15749378
]
Paulo Motta commented on CASSANDRA-12905:
-----------------------------------------
bq. But by making hint delivery async, you made it droppable again. I guess
this is not intentional?
Sorry, should have made it a bit clearer since this was a change from the
original idea. Basically, the hint sender will consider the request failed
after {{write_request_timeout_in_ms}} and retry, so it does not make sense for
the receiver to retry applying hints for longer than that since this can cause
the node to become overloaded.
I believe that making hints deferrable will free up some resources and reduce
contetion, what should reduce the number of WTEs. Furthermore things like
CASSANDRA-10981 should also help in this case. The issue with MV hints timing
out can be further improved by reducing the {{hinted_handoff_throttle_in_kb}}.
If this is not sufficient, we can make a new ticket to try to improve this or
use a lower throttle for MV tables.
bq. You also do not reply on an exception. I am not familiar with
request/response handling but I guess an exception (like WTE) will just drop
the hint and let the hint-sender wait for a reply infinitely or until it times
out?
I just kept the old behavior of responding only in case of success, in this
case if there is a failure the sender will retry after timeout anyway.
Test result look good so far. I added a regression dtest to make sure we test
against this in the future ([pull
request|https://github.com/riptano/cassandra-dtest/pull/1408]), and submitted a
final CI round. If everything looks good I will squash and commit.
Thanks!
> Retry acquire MV lock on failure instead of throwing WTE on streaming
> ---------------------------------------------------------------------
>
> Key: CASSANDRA-12905
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12905
> Project: Cassandra
> Issue Type: Bug
> Components: Streaming and Messaging
> Environment: centos 6.7 x86_64
> Reporter: Nir Zilka
> Assignee: Benjamin Roth
> Priority: Critical
> Fix For: 3.10
>
>
> Hello,
> I performed two upgrades to the current cluster (currently 15 nodes, 1 DC,
> private VLAN),
> first it was 2.2.5.1 and repair worked flawlessly,
> second upgrade was to 3.0.9 (with upgradesstables) and also repair worked
> well,
> then i upgraded 2 weeks ago to 3.9 - and the repair problems started.
> there are several errors types from the system.log (different nodes) :
> - Sync failed between /xxx.xxx.xxx.xxx and /xxx.xxx.xxx.xxx
> - Streaming error occurred on session with peer xxx.xxx.xxx.xxx Operation
> timed out - received only 0 responses
> - Remote peer xxx.xxx.xxx.xxx failed stream session
> - Session completed with the following error
> org.apache.cassandra.streaming.StreamException: Stream failed
> ----
> i use 3.9 default configuration with the cluster settings adjustments (3
> seeds, GossipingPropertyFileSnitch).
> streaming_socket_timeout_in_ms is the default (86400000).
> i'm afraid from consistency problems while i'm not performing repair.
> Any ideas?
> Thanks,
> Nir.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)