[jira] [Commented] (CASSANDRA-3569) Failure detector downs should not break streams

Joshua McKenzie (JIRA) Mon, 26 May 2014 07:55:29 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008899#comment-14008899
 ]


Joshua McKenzie commented on CASSANDRA-3569:
--------------------------------------------

I'm seeing a similar output on the receiving side w/a check for skip < 0 in 
drain:

{code:title=receiving_netstats}
Mode: NORMAL
Repair 78e66860-e4e0-11e3-8b10-0195b332f618
    /192.168.1.31
Repair 7aadbae0-e4e0-11e3-8b10-0195b332f618
    /192.168.1.31
        Receiving 4 files, 2383442 bytes total
Repair 79be51d0-e4e0-11e3-8b10-0195b332f618
    /192.168.1.31
        Receiving 5 files, 866604 bytes total
Repair 7a0a4ef0-e4e0-11e3-8b10-0195b332f618
    /192.168.1.31
        Receiving 5 files, 477981 bytes total
Repair 79673120-e4e0-11e3-8b10-0195b332f618
    /192.168.1.31
        Receiving 5 files, 1014129 bytes total
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name                    Active   Pending      Completed
Commands                        n/a         1             25
Responses                       n/a        76            136
{code}

though that new logic generates the following exception(s):
{code:title=receiving_exception}
ERROR 14:18:11 Exception in thread Thread[NonPeriodicTasks:1,5,main]
java.lang.AssertionError: null
   at org.apache.cassandra.io.util.Memory.free(Memory.java:299) ~[main/:na]
   at 
org.apache.cassandra.utils.obs.OffHeapBitSet.close(OffHeapBitSet.java:143) 
~[main/:na]
   at org.apache.cassandra.utils.BloomFilter.close(BloomFilter.java:116) 
~[main/:na]
   at 
org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:341) 
~[main/:na]
   at 
org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:326) 
~[main/:na]
   at 
org.apache.cassandra.streaming.StreamReceiveTask$1.run(StreamReceiveTask.java:132)
 ~[main/:na]
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
~[na:1.7.0_55]
   at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_55]
   at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
 ~[na:1.7.0_55]
   at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
 ~[na:1.7.0_55]
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
~[na:1.7.0_55]
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_55]
   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
{code}

It looks like the SessionInfo for these plans aren't getting cleared out for 
some reason.  While I can't reproduce that behavior on the sending side, 
hopefully cleaning that up on the receiving side will shed some light on why 
you're seeing that output on the sender.

> Failure detector downs should not break streams
> -----------------------------------------------
>
>                 Key: CASSANDRA-3569
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3569
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Peter Schuller
>            Assignee: Joshua McKenzie
>             Fix For: 2.1.1
>
>         Attachments: 3569-2.0.txt, 3569_v1.txt
>
>
> CASSANDRA-2433 introduced this behavior just to get repairs to don't sit 
> there waiting forever. In my opinion the correct fix to that problem is to 
> use TCP keep alive. Unfortunately the TCP keep alive period is insanely high 
> by default on a modern Linux, so just doing that is not entirely good either.
> But using the failure detector seems non-sensicle to me. We have a 
> communication method which is the TCP transport, that we know is used for 
> long-running processes that you don't want to incorrectly be killed for no 
> good reason, and we are using a failure detector tuned to detecting when not 
> to send real-time sensitive request to nodes in order to actively kill a 
> working connection.
> So, rather than add complexity with protocol based ping/pongs and such, I 
> propose that we simply just use TCP keep alive for streaming connections and 
> instruct operators of production clusters to tweak 
> net.ipv4.tcp_keepalive_{probes,intvl} as appropriate (or whatever equivalent 
> on their OS).
> I can submit the patch. Awaiting opinions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-3569) Failure detector downs should not break streams

Reply via email to