Hi Daniel,

In my experience when a node shows DN and then comes back up by itself that
sounds some sort of gc pause (especially if nodtool status when run from
the "DN" node itself shows it is up-- assuming there isn't a spotty network
issue). Perhaps I missed this info due to length of thread but have you
shared info about the following?

   - cpu/memory usage of affected nodes (are all nodes affected comparably,
   or some more than others?)
   - nodetool compactionstats and tpstats output (especially as the )
   - what is your heap size set to?
   - system.log and gc.logs: for investigating node "DN" symptoms I will
   usually start by noting the timestamp of the "123.56.78.901 is now DOWN"
   entries in system.log of other nodes to tell me where to look in system.log
   of node in question. Then it's a question answer "what was this node doing
   up to that point?"
   - mtime/size of files in data directory-- which files are growing in
   size?

That will help reduce how much we need to speculate. I don't think you
should need to restart cassandra every X days if things are optimally
configured for your read/write pattern-- at least I would not want to use
something where that is the normal expected behavior (and I don't believe
cassandra is one of those sorts of things).

On Thu, Jun 1, 2017 at 11:40 AM, daemeon reiydelle <daeme...@gmail.com>
wrote:

> Some random thoughts; I would like to thank you for giving us an
> interesting problem. Cassandra can get boring sometimes, it is too stable.
>
> - Do you have a way to monitor the network traffic to see if it is
> increasing between restarts or does it seem relatively flat?
> - What activities are happening when you observe the (increasing)
> latencies? Something must be writing to keyspaces, something I presume is
> reading. What is the workload?
> - when using SSD, there are some /devices optimizations for SSD's. I
> wonder if those were done (they will cause some IO latency, but not like
> this)
>
>
>
>
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <%28415%29%20501-0198>London
> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>
>
>
> On Thu, Jun 1, 2017 at 7:18 AM, Daniel Steuernol <dan...@sendwithus.com>
> wrote:
>
>> I am just restarting cassandra. I'm not having any disk space issues I
>> think, but we're having issues where operations have increased latency, and
>> these are fixed by a restart. It seemed like the load reported by nodetool
>> status might be helpful in understanding what is going wrong but I'm not
>> sure. Another symptom is that nodes will report as DN in nodetool status
>> and then come back up again just a minute later.
>>
>> I'm not really sure what to track to find out what exactly is going wrong
>> on the cluster, so any insight or debugging techniques would be super
>> helpful
>>
>>
>> On May 31 2017, at 5:07 pm, Anthony Grasso <anthony.gra...@gmail.com>
>> wrote:
>>
>>> Hi Daniel,
>>>
>>> When you say that the nodes have to be restarted, are you just
>>> restarting the Cassandra service or are you restarting the machine?
>>> How are you reclaiming disk space at the moment? Does disk space free up
>>> after the restart?
>>>
>>> Regarding storage on nodes, keep in mind the more data stored on a node,
>>> the longer some operations to maintain that data will take to complete. In
>>> addition, the more data that is on each node, the long it will take to
>>> stream data to other nodes. Whether it is replacing a down node or
>>> inserting a new node, having a large amount of data on each node will mean
>>> that it takes longer for a node to join the cluster if it is streaming the
>>> data.
>>>
>>> Kind regards,
>>> Anthony
>>>
>>> On 30 May 2017 at 02:43, Daniel Steuernol <dan...@sendwithus.com> wrote:
>>>
>>> The cluster is running with RF=3, right now each node is storing about
>>> 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61
>>> GB of RAM, and the disks attached for the data drive are gp2 ssd ebs
>>> volumes with 10k iops. I guess this brings up the question of what's a good
>>> marker to decide on whether to increase disk space vs provisioning a new
>>> node?
>>>
>>>
>>>
>>> On May 29 2017, at 9:35 am, tommaso barbugli <tbarbu...@gmail.com>
>>> wrote:
>>>
>>> Hi Daniel,
>>>
>>> This is not normal. Possibly a capacity problem. Whats the RF, how much
>>> data do you store per node and what kind of servers do you use (core count,
>>> RAM, disk, ...)?
>>>
>>> Cheers,
>>> Tommaso
>>>
>>> On Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol <dan...@sendwithus.com
>>> > wrote:
>>>
>>>
>>> I am running a 6 node cluster, and I have noticed that the reported load
>>> on each node rises throughout the week and grows way past the actual disk
>>> space used and available on each node. Also eventually latency for
>>> operations suffers and the nodes have to be restarted. A couple questions
>>> on this, is this normal? Also does cassandra need to be restarted every few
>>> days for best performance? Any insight on this behaviour would be helpful.
>>>
>>> Cheers,
>>> Daniel
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For
>>> additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For
>>> additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For
>> additional commands, e-mail: user-h...@cassandra.apache.org
>
>
>

Reply via email to