Re: Restarting nodes and reported load

2017-06-02 Thread Daniel Steuernol
Thanks for the info, this provides a lot to go through, especially Al Tobey's guide.  I'm running java version "1.8.0_121" and using G1GC for the gc type.
  

On Jun 1 2017, at 2:32 pm, Victor Chen <victor.h.c...@gmail.com> wrote:


  Regarding mtime, I'm just talking about using something like the following (assuming you are on linux) "find pathtoyourdatadir -mtime -1 -ls" which will find all files in your datadir last modifed within the past 24h. You can compare increase in your reported nodetool load within the past N days and then use the same period of time to look for files modified that could match that size. Not really sure what sort of load or how long that would take on 3-4T of data though. Regarding compactionstats and tpstats, I would just be interested if there are increasing "pending" tasks for either. Did you say you observed latency issues or degraded performance or not? What version of java/cassandra did you say you were running and what type of gc are you using?Regarding not showing a node not creating "DOWN" entry in log, if a node experiences a sufficiently long gc pause (I'm not sure what the threshold is, maybe somebody more knowledgeable can chime in?), then even though the node itself still "thinks" it's up, other nodes will mark it as DN, thus you wouldn't see a "is now DOWN" entry in the system.log of the gc-ing node, but you would see a "is now DOWN" entry in the system.log of the remote nodes (and a corresponding "is now UP" entry when the node comes out of its gc pause. Assuming the logs have not been rotated off, if you just grep system.log for "DOWN" on your nodes, that usually reveals a useful timestamp from where to start looking on the problematic node's system.log or gc.log.Do you have peristent cpu/memory disk io/ space monitoring mechanisms? You should think about putting something in place to gathering that info if you don't ... I find myself coming back to Al Tobey's tuning guide frequently if nothing else for the tools he mentions and notes on the java gc. I want to say heap size of 15G sounds a little high but I am starting to talk a bit out of my depth when it comes to java tuning. see datastax's official cassandra 2.1 jvm tuning doc and also this stackoverflow thread. good luck!On Thu, Jun 1, 2017 at 4:06 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I'll try to capture answer to questions in the last 2 messages.Network traffic looks pretty steady overall. About 0.5 up to 2 megabytes/s. The cluster handles about 100k to 500k operations per minute, right now the read/write comparison is about 50/50 right now, eventually though it will probably be 70% writes and 30% reads.There does seem to be some nodes that are affected more frequently then others. I haven't captured cpu/memory stats vs other nodes at the time the problem is occurring, I will do that next time it happens. Also I will look at compaction stats and tpstats, what are some things that I should be looking for in tpstats in particular, I'm not exactly sure how to read the output from that command.The heap size is set to 15GB on each node, and each node has 60GB of ram available.In regards to the "... is now DOWN" messages. I'm unable to find one in the system.log for a time I know that a node was having problems. I've built a system that polls nodetool status and parses the output, and if it sees a node reporting as DN it sends a message to a slack channel. Is it possible for a node to report as DN, but not have the message show up in th log?The system polling nodetool status is not the status that was reported as DN.I'm a bit unclear about the last point about mtime/size of files and how to check, can you provide more information there?Thanks for the all the help, I really appreciate it.
  

On Jun 1 2017, at 10:33 am, Victor Chen <victor.h.c...@gmail.com> wrote:


  Hi Daniel,In my experience when a node
 shows DN and then comes back up by itself that sounds some sort of gc pause 
(especially if nodtool status when run from the "DN" node itself shows 
it is up-- assuming there isn't a spotty network issue). Perhaps I missed this info due to length of thread but have 
you shared info about the following?cpu/memory usage of affected nodes (are all nodes affected comparably, or some more than others?)nodetool compactionstats and tpstats output (especially as the )what is your heap size set to?system.log and gc.logs: for investigating node "DN" symptoms I
 will usually start by noting the timestamp of the "123.56.78.901 is now DOWN" 
entries in system.log of other nodes to tell me where to look in 
system.log of node in question. Then it's a question answer "what was 
this node doing up to that point?"mtime/siz

Re: Restarting nodes and reported load

2017-06-01 Thread Daniel Steuernol
I'll try to capture answer to questions in the last 2 messages.Network traffic looks pretty steady overall. About 0.5 up to 2 megabytes/s. The cluster handles about 100k to 500k operations per minute, right now the read/write comparison is about 50/50 right now, eventually though it will probably be 70% writes and 30% reads.There does seem to be some nodes that are affected more frequently then others. I haven't captured cpu/memory stats vs other nodes at the time the problem is occurring, I will do that next time it happens. Also I will look at compaction stats and tpstats, what are some things that I should be looking for in tpstats in particular, I'm not exactly sure how to read the output from that command.The heap size is set to 15GB on each node, and each node has 60GB of ram available.In regards to the "... is now DOWN" messages. I'm unable to find one in the system.log for a time I know that a node was having problems. I've built a system that polls nodetool status and parses the output, and if it sees a node reporting as DN it sends a message to a slack channel. Is it possible for a node to report as DN, but not have the message show up in th log?The system polling nodetool status is not the status that was reported as DN.I'm a bit unclear about the last point about mtime/size of files and how to check, can you provide more information there?Thanks for the all the help, I really appreciate it.
  

On Jun 1 2017, at 10:33 am, Victor Chen <victor.h.c...@gmail.com> wrote:


  Hi Daniel,In my experience when a node
 shows DN and then comes back up by itself that sounds some sort of gc pause 
(especially if nodtool status when run from the "DN" node itself shows 
it is up-- assuming there isn't a spotty network issue). Perhaps I missed this info due to length of thread but have 
you shared info about the following?cpu/memory usage of affected nodes (are all nodes affected comparably, or some more than others?)nodetool compactionstats and tpstats output (especially as the )what is your heap size set to?system.log and gc.logs: for investigating node "DN" symptoms I
 will usually start by noting the timestamp of the "123.56.78.901 is now DOWN" 
entries in system.log of other nodes to tell me where to look in 
system.log of node in question. Then it's a question answer "what was 
this node doing up to that point?"mtime/size of files in data directory-- which files are growing in size? That will help reduce 
how much we need to speculate. I don't think you should need to restart cassandra every X days if things are optimally configured for your read/write pattern-- at least I would not want to use something where that is the normal expected behavior (and I don't believe cassandra is one of those sorts of things).On Thu, Jun 1, 2017 at 11:40 AM, daemeon reiydelle <daeme...@gmail.com> wrote:Some random thoughts; I would like to thank you for giving us an interesting problem. Cassandra can get boring sometimes, it is too stable.- Do you have a way to monitor the network traffic to see if it is increasing between restarts or does it seem relatively flat?- What activities are happening when you observe the (increasing) latencies? Something must be writing to keyspaces, something I presume is reading. What is the workload?- when using SSD, there are some /devices optimizations for SSD's. I 
wonder if those were done (they will cause some IO latency, but not like
 this)Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872
On Thu, Jun 1, 2017 at 7:18 AM, Daniel Steuernol <dan...@sendwithus.com> wrote:I am just restarting cassandra. I'm not having any disk space issues I think, but we're having issues where operations have increased latency, and these are fixed by a restart. It seemed like the load reported by nodetool status might be helpful in understanding what is going wrong but I'm not sure. Another symptom is that nodes will report as DN in nodetool status and then come back up again just a minute later.I'm not really sure what to track to find out what exactly is going wrong on the cluster, so any insight or debugging techniques would be super helpful
  

On May 31 2017, at 5:07 pm, Anthony Grasso <anthony.gra...@gmail.com> wrote:


  Hi Daniel,When you say that the nodes have to be restarted, are you just restarting the Cassandra service or are you restarting the machine?How are you reclaiming disk space at the moment? Does disk space free up after the restart?Regarding storage on nodes, keep in mind the more data stored on a node, the longer some operations to maintain that data will take to complete. In addition, the more data that is on each node, the long it will take to stream data to other nodes. Whether it is replacing a down node or inserting a new node, having a large amo

Re: Restarting nodes and reported load

2017-06-01 Thread Daniel Steuernol
I am just restarting cassandra. I'm not having any disk space issues I think, but we're having issues where operations have increased latency, and these are fixed by a restart. It seemed like the load reported by nodetool status might be helpful in understanding what is going wrong but I'm not sure. Another symptom is that nodes will report as DN in nodetool status and then come back up again just a minute later.I'm not really sure what to track to find out what exactly is going wrong on the cluster, so any insight or debugging techniques would be super helpful
  

On May 31 2017, at 5:07 pm, Anthony Grasso <anthony.gra...@gmail.com> wrote:


  Hi Daniel,When you say that the nodes have to be restarted, are you just restarting the Cassandra service or are you restarting the machine?How are you reclaiming disk space at the moment? Does disk space free up after the restart?Regarding storage on nodes, keep in mind the more data stored on a node, the longer some operations to maintain that data will take to complete. In addition, the more data that is on each node, the long it will take to stream data to other nodes. Whether it is replacing a down node or inserting a new node, having a large amount of data on each node will mean that it takes longer for a node to join the cluster if it is streaming the data.Kind regards,AnthonyOn 30 May 2017 at 02:43, Daniel Steuernol <dan...@sendwithus.com> wrote:The cluster is running with RF=3, right now each node is storing about 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61 GB of RAM, and the disks attached for the data drive are gp2 ssd ebs volumes with 10k iops. I guess this brings up the question of what's a good marker to decide on whether to increase disk space vs provisioning a new node?
  

On May 29 2017, at 9:35 am, tommaso barbugli <tbarbu...@gmail.com> wrote:


  Hi Daniel,This is not normal. Possibly a capacity problem. Whats the RF, how much data do you store per node and what kind of servers do you use (core count, RAM, disk, ...)?Cheers,TommasoOn Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I am running a 6 node cluster, and I have noticed that the reported load on each node rises throughout the week and grows way past the actual disk space used and available on each node. Also eventually latency for operations suffers and the nodes have to be restarted. A couple questions on this, is this normal? Also does cassandra need to be restarted every few days for best performance? Any insight on this behaviour would be helpful.Cheers,Daniel

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




  

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




  

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Restarting nodes and reported load

2017-05-30 Thread Daniel Steuernol
My question is about cassandra, ultimately I'm trying to figure out why our clusters performance degrades approximately every 6 days. I noticed that the load as reported by nodetool status was very high, but that might be unrelated to the problem. A restart solves the performance problem.I've attached a latency graph for inserts into the cluster as you can see over the weekend there was a massive latency spike, and it was fixed by a restart of all the nodes.
  

On May 30 2017, at 2:18 pm, Jonathan Haddad <j...@jonhaddad.com> wrote:


  This isn't an HDFS mailing list.On Tue, May 30, 2017 at 2:14 PM daemeon reiydelle <daeme...@gmail.com> wrote:no, 3tb is small. 30-50tb of hdfs space is typical these days per hdfs node. Depends somewhat on whether there is a mix of more and less frequently accessed data. But even storing only hot data, never saw anything less than 20tb hdfs per node.Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872“All men dream, but not equally. Those who dream by night in the dusty 
recesses of their minds wake up in the day to find it was vanity, but 
the dreamers of the day are dangerous men, for they may act their dreams
 with open eyes, to make it possible.” — T.E. Lawrence
On Tue, May 30, 2017 at 2:00 PM, tommaso barbugli <tbarbu...@gmail.com> wrote:Am I the only one thinking 3TB is way too much data for a single node on a VM?On Tue, May 30, 2017 at 10:36 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I don't believe incremental repair is enabled, I have never enabled it on the cluster, and unless it's the default then it is off. Also I don't see a setting in cassandra.yaml for it.
  

On May 30 2017, at 1:10 pm, daemeon reiydelle <daeme...@gmail.com> wrote:


  Unless there is a bug, snapshots are excluded (they are not HDFS anyway!) from nodetool status. Out of curiousity, is incremenatal repair enabled? This is almost certainly a rat hole, but there was an issue a few releases back where load would only increase until the node was restarted. Had been fixed ages ago, but wondering what happens if you restart a node, IF you have incremental enabled.Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872“All men dream, but not equally. Those who dream by night in the dusty 
recesses of their minds wake up in the day to find it was vanity, but 
the dreamers of the day are dangerous men, for they may act their dreams
 with open eyes, to make it possible.” — T.E. Lawrence
On Tue, May 30, 2017 at 12:15 PM, Varun Gupta <var...@uber.com> wrote:Can you please check if you have incremental backup enabled and snapshots are occupying the space.run nodetool clearsnapshot command.On Tue, May 30, 2017 at 11:12 AM, Daniel Steuernol <dan...@sendwithus.com> wrote:It's 3-4TB per node, and by load rises, I'm talking about load as reported by nodetool status.
  

On May 30 2017, at 10:25 am, daemeon reiydelle <daeme...@gmail.com> wrote:


  When you say "the load rises ... ", could you clarify what you mean by "load"? That has a specific Linux term, and in e.g. Cloudera Manager. But in neither case would that be relevant to transient or persisted disk. Am I missing something?
On Tue, May 30, 2017 at 10:18 AM, tommaso barbugli <tbarbu...@gmail.com> wrote:3-4 TB per node or in total?On Tue, May 30, 2017 at 6:48 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I should also mention that I am running cassandra 3.10 on the cluster
  

On May 29 2017, at 9:43 am, Daniel Steuernol <dan...@sendwithus.com> wrote:


  The cluster is running with RF=3, right now each node is storing about 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61 GB of RAM, and the disks attached for the data drive are gp2 ssd ebs volumes with 10k iops. I guess this brings up the question of what's a good marker to decide on whether to increase disk space vs provisioning a new node?
  

On May 29 2017, at 9:35 am, tommaso barbugli <tbarbu...@gmail.com> wrote:


  Hi Daniel,This is not normal. Possibly a capacity problem. Whats the RF, how much data do you store per node and what kind of servers do you use (core count, RAM, disk, ...)?Cheers,TommasoOn Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I am running a 6 node cluster, and I have noticed that the reported load on each node rises throughout the week and grows way past the actual disk space used and available on each node. Also eventually latency for operations suffers and the nodes have to be restarted. A couple questions on this, is this

Re: Restarting nodes and reported load

2017-05-30 Thread Daniel Steuernol
That does sound like what's happening, did performance degrade as the reported load increased?
  

On May 30 2017, at 1:52 pm, daemeon reiydelle <daeme...@gmail.com> wrote:


  OK, thanks.So there was a bug in a prior version of C*, symptoms were:Nodetool would show increasing load utilization over time. Stopping and restarting C* nodes would reset the storage back to what one would expect on that node, for a while, then it would creep upwards again, until the node(s) are restarted, etc. FYI it ONLY occurred on an in-use system, etc.I know (double checked) that the problem was fixed a while back. Wondering if it resurfaced? Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872“All men dream, but not equally. Those who dream by night in the dusty 
recesses of their minds wake up in the day to find it was vanity, but 
the dreamers of the day are dangerous men, for they may act their dreams
 with open eyes, to make it possible.” — T.E. Lawrence
On Tue, May 30, 2017 at 1:36 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I don't believe incremental repair is enabled, I have never enabled it on the cluster, and unless it's the default then it is off. Also I don't see a setting in cassandra.yaml for it.
  

On May 30 2017, at 1:10 pm, daemeon reiydelle <daeme...@gmail.com> wrote:


  Unless there is a bug, snapshots are excluded (they are not HDFS anyway!) from nodetool status. Out of curiousity, is incremenatal repair enabled? This is almost certainly a rat hole, but there was an issue a few releases back where load would only increase until the node was restarted. Had been fixed ages ago, but wondering what happens if you restart a node, IF you have incremental enabled.Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872“All men dream, but not equally. Those who dream by night in the dusty 
recesses of their minds wake up in the day to find it was vanity, but 
the dreamers of the day are dangerous men, for they may act their dreams
 with open eyes, to make it possible.” — T.E. Lawrence
On Tue, May 30, 2017 at 12:15 PM, Varun Gupta <var...@uber.com> wrote:Can you please check if you have incremental backup enabled and snapshots are occupying the space.run nodetool clearsnapshot command.On Tue, May 30, 2017 at 11:12 AM, Daniel Steuernol <dan...@sendwithus.com> wrote:It's 3-4TB per node, and by load rises, I'm talking about load as reported by nodetool status.
  

On May 30 2017, at 10:25 am, daemeon reiydelle <daeme...@gmail.com> wrote:


  When you say "the load rises ... ", could you clarify what you mean by "load"? That has a specific Linux term, and in e.g. Cloudera Manager. But in neither case would that be relevant to transient or persisted disk. Am I missing something?
On Tue, May 30, 2017 at 10:18 AM, tommaso barbugli <tbarbu...@gmail.com> wrote:3-4 TB per node or in total?On Tue, May 30, 2017 at 6:48 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I should also mention that I am running cassandra 3.10 on the cluster
      
    
On May 29 2017, at 9:43 am, Daniel Steuernol <dan...@sendwithus.com> wrote:


  The cluster is running with RF=3, right now each node is storing about 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61 GB of RAM, and the disks attached for the data drive are gp2 ssd ebs volumes with 10k iops. I guess this brings up the question of what's a good marker to decide on whether to increase disk space vs provisioning a new node?
  

On May 29 2017, at 9:35 am, tommaso barbugli <tbarbu...@gmail.com> wrote:


  Hi Daniel,This is not normal. Possibly a capacity problem. Whats the RF, how much data do you store per node and what kind of servers do you use (core count, RAM, disk, ...)?Cheers,TommasoOn Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I am running a 6 node cluster, and I have noticed that the reported load on each node rises throughout the week and grows way past the actual disk space used and available on each node. Also eventually latency for operations suffers and the nodes have to be restarted. A couple questions on this, is this normal? Also does cassandra need to be restarted every few days for best performance? Any insight on this behaviour would be helpful.Cheers,Daniel

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




  

  




  

-

Re: Restarting nodes and reported load

2017-05-30 Thread Daniel Steuernol
I don't believe incremental repair is enabled, I have never enabled it on the cluster, and unless it's the default then it is off. Also I don't see a setting in cassandra.yaml for it.
  

On May 30 2017, at 1:10 pm, daemeon reiydelle <daeme...@gmail.com> wrote:


  Unless there is a bug, snapshots are excluded (they are not HDFS anyway!) from nodetool status. Out of curiousity, is incremenatal repair enabled? This is almost certainly a rat hole, but there was an issue a few releases back where load would only increase until the node was restarted. Had been fixed ages ago, but wondering what happens if you restart a node, IF you have incremental enabled.Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872“All men dream, but not equally. Those who dream by night in the dusty 
recesses of their minds wake up in the day to find it was vanity, but 
the dreamers of the day are dangerous men, for they may act their dreams
 with open eyes, to make it possible.” — T.E. Lawrence
On Tue, May 30, 2017 at 12:15 PM, Varun Gupta <var...@uber.com> wrote:Can you please check if you have incremental backup enabled and snapshots are occupying the space.run nodetool clearsnapshot command.On Tue, May 30, 2017 at 11:12 AM, Daniel Steuernol <dan...@sendwithus.com> wrote:It's 3-4TB per node, and by load rises, I'm talking about load as reported by nodetool status.
  

On May 30 2017, at 10:25 am, daemeon reiydelle <daeme...@gmail.com> wrote:


  When you say "the load rises ... ", could you clarify what you mean by "load"? That has a specific Linux term, and in e.g. Cloudera Manager. But in neither case would that be relevant to transient or persisted disk. Am I missing something?
On Tue, May 30, 2017 at 10:18 AM, tommaso barbugli <tbarbu...@gmail.com> wrote:3-4 TB per node or in total?On Tue, May 30, 2017 at 6:48 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I should also mention that I am running cassandra 3.10 on the cluster
  
    
On May 29 2017, at 9:43 am, Daniel Steuernol <dan...@sendwithus.com> wrote:


  The cluster is running with RF=3, right now each node is storing about 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61 GB of RAM, and the disks attached for the data drive are gp2 ssd ebs volumes with 10k iops. I guess this brings up the question of what's a good marker to decide on whether to increase disk space vs provisioning a new node?
  

On May 29 2017, at 9:35 am, tommaso barbugli <tbarbu...@gmail.com> wrote:


  Hi Daniel,This is not normal. Possibly a capacity problem. Whats the RF, how much data do you store per node and what kind of servers do you use (core count, RAM, disk, ...)?Cheers,TommasoOn Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I am running a 6 node cluster, and I have noticed that the reported load on each node rises throughout the week and grows way past the actual disk space used and available on each node. Also eventually latency for operations suffers and the nodes have to be restarted. A couple questions on this, is this normal? Also does cassandra need to be restarted every few days for best performance? Any insight on this behaviour would be helpful.Cheers,Daniel

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




  

  




  

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org





  

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Restarting nodes and reported load

2017-05-30 Thread Daniel Steuernol
incremental backup is set to false in the config file, also I have set  snapshot_before_compaction and auto_snapshot to false as well. I ran nodetool clearsnapshot, but before doing that I ran nodetool listsnapshots and it listed a bunch of snapshots. I would have expected that to be empty because I've disabled auto_snapshot.
  

On May 30 2017, at 12:15 pm, Varun Gupta <var...@uber.com> wrote:


  Can you please check if you have incremental backup enabled and snapshots are occupying the space.run nodetool clearsnapshot command.On Tue, May 30, 2017 at 11:12 AM, Daniel Steuernol <dan...@sendwithus.com> wrote:It's 3-4TB per node, and by load rises, I'm talking about load as reported by nodetool status.
  

On May 30 2017, at 10:25 am, daemeon reiydelle <daeme...@gmail.com> wrote:


  When you say "the load rises ... ", could you clarify what you mean by "load"? That has a specific Linux term, and in e.g. Cloudera Manager. But in neither case would that be relevant to transient or persisted disk. Am I missing something?
On Tue, May 30, 2017 at 10:18 AM, tommaso barbugli <tbarbu...@gmail.com> wrote:3-4 TB per node or in total?On Tue, May 30, 2017 at 6:48 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I should also mention that I am running cassandra 3.10 on the cluster
  
        
    On May 29 2017, at 9:43 am, Daniel Steuernol <dan...@sendwithus.com> wrote:


  The cluster is running with RF=3, right now each node is storing about 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61 GB of RAM, and the disks attached for the data drive are gp2 ssd ebs volumes with 10k iops. I guess this brings up the question of what's a good marker to decide on whether to increase disk space vs provisioning a new node?
  

On May 29 2017, at 9:35 am, tommaso barbugli <tbarbu...@gmail.com> wrote:


  Hi Daniel,This is not normal. Possibly a capacity problem. Whats the RF, how much data do you store per node and what kind of servers do you use (core count, RAM, disk, ...)?Cheers,TommasoOn Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I am running a 6 node cluster, and I have noticed that the reported load on each node rises throughout the week and grows way past the actual disk space used and available on each node. Also eventually latency for operations suffers and the nodes have to be restarted. A couple questions on this, is this normal? Also does cassandra need to be restarted every few days for best performance? Any insight on this behaviour would be helpful.Cheers,Daniel

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




  

  




  

-
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




  

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Restarting nodes and reported load

2017-05-30 Thread Daniel Steuernol
It's 3-4TB per node, and by load rises, I'm talking about load as reported by nodetool status.
  

On May 30 2017, at 10:25 am, daemeon reiydelle <daeme...@gmail.com> wrote:


  When you say "the load rises ... ", could you clarify what you mean by "load"? That has a specific Linux term, and in e.g. Cloudera Manager. But in neither case would that be relevant to transient or persisted disk. Am I missing something?
On Tue, May 30, 2017 at 10:18 AM, tommaso barbugli <tbarbu...@gmail.com> wrote:3-4 TB per node or in total?On Tue, May 30, 2017 at 6:48 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I should also mention that I am running cassandra 3.10 on the cluster
  

        On May 29 2017, at 9:43 am, Daniel Steuernol <dan...@sendwithus.com> wrote:


  The cluster is running with RF=3, right now each node is storing about 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61 GB of RAM, and the disks attached for the data drive are gp2 ssd ebs volumes with 10k iops. I guess this brings up the question of what's a good marker to decide on whether to increase disk space vs provisioning a new node?
  

On May 29 2017, at 9:35 am, tommaso barbugli <tbarbu...@gmail.com> wrote:


  Hi Daniel,This is not normal. Possibly a capacity problem. Whats the RF, how much data do you store per node and what kind of servers do you use (core count, RAM, disk, ...)?Cheers,TommasoOn Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I am running a 6 node cluster, and I have noticed that the reported load on each node rises throughout the week and grows way past the actual disk space used and available on each node. Also eventually latency for operations suffers and the nodes have to be restarted. A couple questions on this, is this normal? Also does cassandra need to be restarted every few days for best performance? Any insight on this behaviour would be helpful.Cheers,Daniel

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




  

  




  

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Restarting nodes and reported load

2017-05-30 Thread Daniel Steuernol
I should also mention that I am running cassandra 3.10 on the cluster
  

On May 29 2017, at 9:43 am, Daniel Steuernol <dan...@sendwithus.com> wrote:


  The cluster is running with RF=3, right now each node is storing about 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61 GB of RAM, and the disks attached for the data drive are gp2 ssd ebs volumes with 10k iops. I guess this brings up the question of what's a good marker to decide on whether to increase disk space vs provisioning a new node?
  

On May 29 2017, at 9:35 am, tommaso barbugli <tbarbu...@gmail.com> wrote:


  Hi Daniel,This is not normal. Possibly a capacity problem. Whats the RF, how much data do you store per node and what kind of servers do you use (core count, RAM, disk, ...)?Cheers,TommasoOn Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I am running a 6 node cluster, and I have noticed that the reported load on each node rises throughout the week and grows way past the actual disk space used and available on each node. Also eventually latency for operations suffers and the nodes have to be restarted. A couple questions on this, is this normal? Also does cassandra need to be restarted every few days for best performance? Any insight on this behaviour would be helpful.Cheers,Daniel

-
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




  

  

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Restarting nodes and reported load

2017-05-29 Thread Daniel Steuernol
The cluster is running with RF=3, right now each node is storing about 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61 GB of RAM, and the disks attached for the data drive are gp2 ssd ebs volumes with 10k iops. I guess this brings up the question of what's a good marker to decide on whether to increase disk space vs provisioning a new node?
  

On May 29 2017, at 9:35 am, tommaso barbugli <tbarbu...@gmail.com> wrote:


  Hi Daniel,This is not normal. Possibly a capacity problem. Whats the RF, how much data do you store per node and what kind of servers do you use (core count, RAM, disk, ...)?Cheers,TommasoOn Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I am running a 6 node cluster, and I have noticed that the reported load on each node rises throughout the week and grows way past the actual disk space used and available on each node. Also eventually latency for operations suffers and the nodes have to be restarted. A couple questions on this, is this normal? Also does cassandra need to be restarted every few days for best performance? Any insight on this behaviour would be helpful.Cheers,Daniel

-
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




  

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Restarting nodes and reported load

2017-05-29 Thread Daniel Steuernol
I am running a 6 node cluster, and I have noticed that the reported load on each node rises throughout the week and grows way past the actual disk space used and available on each node. Also eventually latency for operations suffers and the nodes have to be restarted. A couple questions on this, is this normal? Also does cassandra need to be restarted every few days for best performance? Any insight on this behaviour would be helpful.Cheers,Daniel

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Nodes stopping

2017-05-11 Thread Daniel Steuernol
I'm switching the instances to machines with 61G of RAM, in this case would you still recommend using 8G of heap space?Here is a gist of my heap settings from jvm.optionshttps://gist.github.com/dlsteuer/40e80280029897e6bb5fd12f2a86cbbe
  

On May 11 2017, at 3:08 pm, Alain RODRIGUEZ <arodr...@gmail.com> wrote:


  For some context, I'm trying to get regular repairs going but am having issues with it.You're not the only one, repairs are a real concern for many people.For what it is worth, my team is actively working on this project initiated at Spotify: https://github.com/thelastpickle/cassandra-reaper.C*heers,---Alain Rodriguez - @arodream - al...@thelastpickle.comFranceThe Last Pickle - Apache Cassandra Consultinghttp://www.thelastpickle.com2017-05-11 23:04 GMT+01:00 Alain RODRIGUEZ <arodr...@gmail.com>:Hi Daniel,Could you paste the exact GC options in use?Also 30 GB is not much. I would not use more than 8 GB for the JVM and probably CMS in those conditions for what it is worth. The thing is if memtables, bloom filter, caches, indexes, etc are off heap, then you probably ran out of Native memory. In any case it is good to have some space for page cache.As a reminder you can try new GC option in a canary node, see how it goes.C*heers,---Alain Rodriguez - @arodream - al...@thelastpickle.comFranceThe Last Pickle - Apache Cassandra Consultinghttp://www.thelastpickle.com2017-05-11 22:29 GMT+01:00 Daniel Steuernol <dan...@sendwithus.com>:Thank you, it's an Out of memory crash according to dmesg. I have the heap size set to 15G in the jvm.options for cassandra, and there is 30G on the machine.
  

On May 11 2017, at 2:22 pm, Cogumelos Maravilha <cogumelosmaravi...@sapo.pt> wrote:


  
  

  
  
Have a look at dmesg. It have already happened to me regarding
  type i instances at AWS.

    On 11-05-2017 22:17, Daniel Steuernol
  wrote:

I had 2 nodes go down today, here is the ERRORs from
  the system log on both nodes
  
https://gist.github.com/dlsteuer/28c610bc733a2bff22c8d3953ef8c218

For some context, I'm trying to get regular repairs going
  but am having issues with it.


  

On May 11 2017, at 2:10 pm, Cogumelos Maravilha
<cogumelosmaravi...@sapo.pt> wrote: 

  
  Can you grep ERROR system.log
  
  
      On 11-05-2017 21:52, Daniel Steuernol wrote:
  
  There is nothing in the system log
about it being drained or shutdown, I'm not sure how else it
would be pre-empted. No one else on the team is on the
servers and I haven't been shutting them down. There also is
no java memory dump on the server either. It appears that
the process just died.

 
  On May 11 2017, at 1:36 pm, Varun Gupta <var...@uber.com>
  wrote: 
  

  What do you mean by "no obvious error in the
logs", do you see node was drained or shutdown. Are
you sure, no other process is calling nodetool drain
or shutdown, OR pre-empting cassandra process?


      On Thu, May 11, 2017 at 1:30 PM, Daniel Steuernol
<dan...@sendwithus.com>
wrote:

  I have a 6 node cassandra cluster running, and
  frequently a node will go down with no obvious
  error in the logs. This is starting to happen
  quite often, almost daily now. Any suggestions on
  how to track down what is causing the node to
  stop? -
  To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
  For additional commands, e-mail: user-h...@cassandra.apache.org

  
  

  

- To
unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org
  
  

  
-
  To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
  For additional commands, e-mail: user-h...@cassandra.apache.org


  



  

-
To unsubscribe, e-mail: user-unsu

Re: Nodes stopping

2017-05-11 Thread Daniel Steuernol
Thank you, it's an Out of memory crash according to dmesg. I have the heap size set to 15G in the jvm.options for cassandra, and there is 30G on the machine.
  

On May 11 2017, at 2:22 pm, Cogumelos Maravilha <cogumelosmaravi...@sapo.pt> wrote:


  
  

  
  
Have a look at dmesg. It have already happened to me regarding
  type i instances at AWS.

On 11-05-2017 22:17, Daniel Steuernol
  wrote:

I had 2 nodes go down today, here is the ERRORs from
  the system log on both nodes
  
https://gist.github.com/dlsteuer/28c610bc733a2bff22c8d3953ef8c218

For some context, I'm trying to get regular repairs going
  but am having issues with it.


  

On May 11 2017, at 2:10 pm, Cogumelos Maravilha
<cogumelosmaravi...@sapo.pt> wrote: 

  
  Can you grep ERROR system.log
  
  
  On 11-05-2017 21:52, Daniel Steuernol wrote:
  
  There is nothing in the system log
about it being drained or shutdown, I'm not sure how else it
would be pre-empted. No one else on the team is on the
servers and I haven't been shutting them down. There also is
no java memory dump on the server either. It appears that
the process just died.

 
  On May 11 2017, at 1:36 pm, Varun Gupta <var...@uber.com>
  wrote: 
  

  What do you mean by "no obvious error in the
logs", do you see node was drained or shutdown. Are
you sure, no other process is calling nodetool drain
or shutdown, OR pre-empting cassandra process?


  On Thu, May 11, 2017 at 1:30 PM, Daniel Steuernol
<dan...@sendwithus.com>
wrote:

  I have a 6 node cassandra cluster running, and
  frequently a node will go down with no obvious
  error in the logs. This is starting to happen
  quite often, almost daily now. Any suggestions on
  how to track down what is causing the node to
  stop? -
  To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
  For additional commands, e-mail: user-h...@cassandra.apache.org

  
  

  

- To
unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org
  
  

  
-
  To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
  For additional commands, e-mail: user-h...@cassandra.apache.org


  



  

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Nodes stopping

2017-05-11 Thread Daniel Steuernol
I had 2 nodes go down today, here is the ERRORs from the system log on both nodeshttps://gist.github.com/dlsteuer/28c610bc733a2bff22c8d3953ef8c218For some context, I'm trying to get regular repairs going but am having issues with it.
  

On May 11 2017, at 2:10 pm, Cogumelos Maravilha <cogumelosmaravi...@sapo.pt> wrote:


  
  

  
  
Can you grep ERROR system.log


On 11-05-2017 21:52, Daniel Steuernol
  wrote:

There is nothing in the system log about it being
  drained or shutdown, I'm not sure how else it would be pre-empted.
  No one else on the team is on the servers and I haven't been
  shutting them down. There also is no java memory dump on the
  server either. It appears that the process just died.
  
  
  

On May 11 2017, at 1:36 pm, Varun Gupta <var...@uber.com>
wrote: 

  
What do you mean by "no obvious error in the logs", do
  you see node was drained or shutdown. Are you sure, no
  other process is calling nodetool drain or shutdown, OR
  pre-empting cassandra process?
  
  
On Thu, May 11, 2017 at 1:30 PM, Daniel Steuernol <dan...@sendwithus.com>
  wrote:
  
I have a 6 node cassandra cluster running, and
frequently a node will go down with no obvious error in
the logs. This is starting to happen quite often, almost
daily now. Any suggestions on how to track down what is
causing the node to stop?
-
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org
  


  

  
-
  To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
  For additional commands, e-mail: user-h...@cassandra.apache.org


  



  

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Nodes stopping

2017-05-11 Thread Daniel Steuernol
There is nothing in the system log about it being drained or shutdown, I'm not sure how else it would be pre-empted. No one else on the team is on the servers and I haven't been shutting them down. There also is no java memory dump on the server either. It appears that the process just died.
  

On May 11 2017, at 1:36 pm, Varun Gupta <var...@uber.com> wrote:


  What do you mean by "no obvious error in the logs", do you see node was drained or shutdown. Are you sure, no other process is calling nodetool drain or shutdown, OR pre-empting cassandra process?On Thu, May 11, 2017 at 1:30 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:I have a 6 node cassandra cluster running, and frequently a node will go down with no obvious error in the logs. This is starting to happen quite often, almost daily now. Any suggestions on how to track down what is causing the node to stop?

-
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




  

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Nodes stopping

2017-05-11 Thread Daniel Steuernol
I have a 6 node cassandra cluster running, and frequently a node will go down with no obvious error in the logs. This is starting to happen quite often, almost daily now. Any suggestions on how to track down what is causing the node to stop?

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org