Re: Really odd issue (AWS related?)

2013-04-30 Thread Ben Chobot
We've also had issues with ephemeral drives in a single AZ in us-east-1, so 
much so that we no longer use that AZ. Though our issues tended to be obvious 
from instance boot - they wouldn't suddenly degrade.

On Apr 28, 2013, at 2:27 PM, Alex Major wrote:

 Hi Mike,
 
 We had issues with the ephemeral drives when we first got started, although 
 we never got to the bottom of it so I can't help much with troubleshooting 
 unfortunately. Contrary to a lot of the comments on the mailing list we've 
 actually had a lot more success with EBS drives (PIOPs!). I'd definitely 
 suggest try striping 4 EBS drives (Raid 0) and using PIOPs.
 
 You could be having a noisy neighbour problem, I don't believe that m1.large 
 or m1.xlarge instances get all of the actual hardware, virtualisation on EC2 
 still sucks in isolating resources.
 
 We've also had more success with Ubuntu on EC2, not so much with our 
 Cassandra nodes but some of our other services didn't run as well on Amazon 
 Linux AMIs.
 
 Alex
 
 
 
 On Sun, Apr 28, 2013 at 7:12 PM, Michael Theroux mthero...@yahoo.com wrote:
 I forgot to mention,
 
 When things go really bad, I'm seeing I/O waits in the 80-95% range.  I 
 restarted cassandra once when a node is in this situation, and it took 45 
 minutes to start (primarily reading SSTables).  Typically, a node would start 
 in about 5 minutes.
 
 Thanks,
 -Mike
  
 On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote:
 
 Hello,
 
 We've done some additional monitoring, and I think we have more information. 
  We've been collecting vmstat information every minute, attempting to catch  
 a node with issues,.
 
 So, it appears, that the cassandra node runs fine.  Then suddenly, without 
 any correlation to any event that I can identify, the I/O wait time goes way 
 up, and stays up indefinitely.  Even non-cassandra  I/O activities (such as 
 snapshots and backups) start causing large I/O Wait times when they 
 typically would not.  Previous to an issue, we would typically see I/O wait 
 times 3-4% with very few blocked processes on I/O.  Once this issue 
 manifests itself, i/O wait times for the same activities jump to 30-40% with 
 many blocked processes.  The I/O wait times do go back down when there is 
 literally no activity.   
 
 -  Updating the node to the latest Amazon Linux patches and rebooting the 
 instance doesn't correct the issue.
 -  Backing up the node, and replacing the instance does correct the issue.  
 I/O wait times return to normal.
 
 One relatively recent change we've made is we upgraded to m1.xlarge 
 instances which has 4 ephemeral drives available.  We create a logical 
 volume from the 4 drives with the idea that we should be able to get 
 increased I/O throughput.  When we ran m1.large instances, we had the same 
 setup, although it was only using 2 ephemeral drives.  We chose to use LVM, 
 vs. madm because we were having issues having madm create the raid volume 
 reliably on restart (and research showed that this was a common problem).  
 LVM just worked (and had worked for months before this upgrade)..
 
 For reference, this is the script we used to create the logical volume:
 
 vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
 lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
 blockdev --setra 65536 /dev/mnt_vg/mnt_lv
 sleep 2
 mkfs.xfs /dev/mnt_vg/mnt_lv
 sleep 3
 mkdir -p /data  mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
 sleep 3
 
 Another tidbit... thus far (and this maybe only a coincidence), we've only 
 had to replace DB nodes within a single availability zone within us-east.  
 Other availability zones, in the same region, have yet to show an issue.
 
 It looks like I'm going to need to replace a third DB node today.  Any 
 advice would be appreciated.
 
 Thanks,
 -Mike
 
 
 On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:
 
 Thanks.
 
 We weren't monitoring this value when the issue occurred, and this 
 particular issue has not appeared for a couple of days (knock on wood).  
 Will keep an eye out though,
 
 -Mike
 
 On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
 
 top command? st : time stolen from this vm by the hypervisor
 
 jason
 
 
 On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Sorry, Not sure what CPU steal is :)
 
 I have AWS console with detailed monitoring enabled... things seem to 
 track close to the minute, so I can see the CPU load go to 0... then jump 
 at about the minute Cassandra reports the dropped messages,
 
 -Mike
 
 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
 
 The messages appear right after the node wakes up.
 Are you tracking CPU steal ? 
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:
 
 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Another related question.  Once we see messages being dropped on one 
 node, our 

Re: Really odd issue (AWS related?)

2013-04-28 Thread Michael Theroux
Hello,

We've done some additional monitoring, and I think we have more information.  
We've been collecting vmstat information every minute, attempting to catch  a 
node with issues,.

So, it appears, that the cassandra node runs fine.  Then suddenly, without any 
correlation to any event that I can identify, the I/O wait time goes way up, 
and stays up indefinitely.  Even non-cassandra  I/O activities (such as 
snapshots and backups) start causing large I/O Wait times when they typically 
would not.  Previous to an issue, we would typically see I/O wait times 3-4% 
with very few blocked processes on I/O.  Once this issue manifests itself, i/O 
wait times for the same activities jump to 30-40% with many blocked processes.  
The I/O wait times do go back down when there is literally no activity.   

-  Updating the node to the latest Amazon Linux patches and rebooting the 
instance doesn't correct the issue.
-  Backing up the node, and replacing the instance does correct the issue.  I/O 
wait times return to normal.

One relatively recent change we've made is we upgraded to m1.xlarge instances 
which has 4 ephemeral drives available.  We create a logical volume from the 4 
drives with the idea that we should be able to get increased I/O throughput.  
When we ran m1.large instances, we had the same setup, although it was only 
using 2 ephemeral drives.  We chose to use LVM, vs. madm because we were having 
issues having madm create the raid volume reliably on restart (and research 
showed that this was a common problem).  LVM just worked (and had worked for 
months before this upgrade)..

For reference, this is the script we used to create the logical volume:

vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
blockdev --setra 65536 /dev/mnt_vg/mnt_lv
sleep 2
mkfs.xfs /dev/mnt_vg/mnt_lv
sleep 3
mkdir -p /data  mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
sleep 3

Another tidbit... thus far (and this maybe only a coincidence), we've only had 
to replace DB nodes within a single availability zone within us-east.  Other 
availability zones, in the same region, have yet to show an issue.

It looks like I'm going to need to replace a third DB node today.  Any advice 
would be appreciated.

Thanks,
-Mike


On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:

 Thanks.
 
 We weren't monitoring this value when the issue occurred, and this particular 
 issue has not appeared for a couple of days (knock on wood).  Will keep an 
 eye out though,
 
 -Mike
 
 On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
 
 top command? st : time stolen from this vm by the hypervisor
 
 jason
 
 
 On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com wrote:
 Sorry, Not sure what CPU steal is :)
 
 I have AWS console with detailed monitoring enabled... things seem to track 
 close to the minute, so I can see the CPU load go to 0... then jump at about 
 the minute Cassandra reports the dropped messages,
 
 -Mike
 
 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
 
 The messages appear right after the node wakes up.
 Are you tracking CPU steal ? 
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:
 
 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Another related question.  Once we see messages being dropped on one 
 node, our cassandra client appears to see this, reporting errors.  We use 
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would 
 see an error?  If only one node reports an error, shouldn't the 
 consistency level prevent the client from seeing an issue?
 
 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.
 
 =Rob
 
 
 
 



Re: Really odd issue (AWS related?)

2013-04-28 Thread Michael Theroux
I forgot to mention,

When things go really bad, I'm seeing I/O waits in the 80-95% range.  I 
restarted cassandra once when a node is in this situation, and it took 45 
minutes to start (primarily reading SSTables).  Typically, a node would start 
in about 5 minutes.

Thanks,
-Mike
 
On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote:

 Hello,
 
 We've done some additional monitoring, and I think we have more information.  
 We've been collecting vmstat information every minute, attempting to catch  a 
 node with issues,.
 
 So, it appears, that the cassandra node runs fine.  Then suddenly, without 
 any correlation to any event that I can identify, the I/O wait time goes way 
 up, and stays up indefinitely.  Even non-cassandra  I/O activities (such as 
 snapshots and backups) start causing large I/O Wait times when they typically 
 would not.  Previous to an issue, we would typically see I/O wait times 3-4% 
 with very few blocked processes on I/O.  Once this issue manifests itself, 
 i/O wait times for the same activities jump to 30-40% with many blocked 
 processes.  The I/O wait times do go back down when there is literally no 
 activity.   
 
 -  Updating the node to the latest Amazon Linux patches and rebooting the 
 instance doesn't correct the issue.
 -  Backing up the node, and replacing the instance does correct the issue.  
 I/O wait times return to normal.
 
 One relatively recent change we've made is we upgraded to m1.xlarge instances 
 which has 4 ephemeral drives available.  We create a logical volume from the 
 4 drives with the idea that we should be able to get increased I/O 
 throughput.  When we ran m1.large instances, we had the same setup, although 
 it was only using 2 ephemeral drives.  We chose to use LVM, vs. madm because 
 we were having issues having madm create the raid volume reliably on restart 
 (and research showed that this was a common problem).  LVM just worked (and 
 had worked for months before this upgrade)..
 
 For reference, this is the script we used to create the logical volume:
 
 vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
 lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
 blockdev --setra 65536 /dev/mnt_vg/mnt_lv
 sleep 2
 mkfs.xfs /dev/mnt_vg/mnt_lv
 sleep 3
 mkdir -p /data  mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
 sleep 3
 
 Another tidbit... thus far (and this maybe only a coincidence), we've only 
 had to replace DB nodes within a single availability zone within us-east.  
 Other availability zones, in the same region, have yet to show an issue.
 
 It looks like I'm going to need to replace a third DB node today.  Any advice 
 would be appreciated.
 
 Thanks,
 -Mike
 
 
 On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:
 
 Thanks.
 
 We weren't monitoring this value when the issue occurred, and this 
 particular issue has not appeared for a couple of days (knock on wood).  
 Will keep an eye out though,
 
 -Mike
 
 On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
 
 top command? st : time stolen from this vm by the hypervisor
 
 jason
 
 
 On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Sorry, Not sure what CPU steal is :)
 
 I have AWS console with detailed monitoring enabled... things seem to track 
 close to the minute, so I can see the CPU load go to 0... then jump at 
 about the minute Cassandra reports the dropped messages,
 
 -Mike
 
 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
 
 The messages appear right after the node wakes up.
 Are you tracking CPU steal ? 
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:
 
 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Another related question.  Once we see messages being dropped on one 
 node, our cassandra client appears to see this, reporting errors.  We 
 use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients 
 would see an error?  If only one node reports an error, shouldn't the 
 consistency level prevent the client from seeing an issue?
 
 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.
 
 =Rob
 
 
 
 
 



Re: Really odd issue (AWS related?)

2013-04-28 Thread Alex Major
Hi Mike,

We had issues with the ephemeral drives when we first got started, although
we never got to the bottom of it so I can't help much with troubleshooting
unfortunately. Contrary to a lot of the comments on the mailing list we've
actually had a lot more success with EBS drives (PIOPs!). I'd definitely
suggest try striping 4 EBS drives (Raid 0) and using PIOPs.

You could be having a noisy neighbour problem, I don't believe that
m1.large or m1.xlarge instances get all of the actual hardware,
virtualisation on EC2 still sucks in isolating resources.

We've also had more success with Ubuntu on EC2, not so much with our
Cassandra nodes but some of our other services didn't run as well on Amazon
Linux AMIs.

Alex



On Sun, Apr 28, 2013 at 7:12 PM, Michael Theroux mthero...@yahoo.comwrote:

 I forgot to mention,

 When things go really bad, I'm seeing I/O waits in the 80-95% range.  I
 restarted cassandra once when a node is in this situation, and it took 45
 minutes to start (primarily reading SSTables).  Typically, a node would
 start in about 5 minutes.

 Thanks,
 -Mike

 On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote:

 Hello,

 We've done some additional monitoring, and I think we have more
 information.  We've been collecting vmstat information every minute,
 attempting to catch  a node with issues,.

 So, it appears, that the cassandra node runs fine.  Then suddenly, without
 any correlation to any event that I can identify, the I/O wait time goes
 way up, and stays up indefinitely.  Even non-cassandra  I/O activities
 (such as snapshots and backups) start causing large I/O Wait times when
 they typically would not.  Previous to an issue, we would typically see I/O
 wait times 3-4% with very few blocked processes on I/O.  Once this issue
 manifests itself, i/O wait times for the same activities jump to 30-40%
 with many blocked processes.  The I/O wait times do go back down when there
 is literally no activity.

 -  Updating the node to the latest Amazon Linux patches and rebooting the
 instance doesn't correct the issue.
 -  Backing up the node, and replacing the instance does correct the issue.
  I/O wait times return to normal.

 One relatively recent change we've made is we upgraded to m1.xlarge
 instances which has 4 ephemeral drives available.  We create a logical
 volume from the 4 drives with the idea that we should be able to get
 increased I/O throughput.  When we ran m1.large instances, we had the same
 setup, although it was only using 2 ephemeral drives.  We chose to use LVM,
 vs. madm because we were having issues having madm create the raid volume
 reliably on restart (and research showed that this was a common problem).
  LVM just worked (and had worked for months before this upgrade)..

 For reference, this is the script we used to create the logical volume:

 vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
 lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
 blockdev --setra 65536 /dev/mnt_vg/mnt_lv
 sleep 2
 mkfs.xfs /dev/mnt_vg/mnt_lv
 sleep 3
 mkdir -p /data  mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
 sleep 3

 Another tidbit... thus far (and this maybe only a coincidence), we've only
 had to replace DB nodes within a single availability zone within us-east.
  Other availability zones, in the same region, have yet to show an issue.

 It looks like I'm going to need to replace a third DB node today.  Any
 advice would be appreciated.

 Thanks,
 -Mike


 On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:

 Thanks.

 We weren't monitoring this value when the issue occurred, and this
 particular issue has not appeared for a couple of days (knock on wood).
  Will keep an eye out though,

 -Mike

 On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:

 top command? st : time stolen from this vm by the hypervisor

 jason


 On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.comwrote:

 Sorry, Not sure what CPU steal is :)

 I have AWS console with detailed monitoring enabled... things seem to
 track close to the minute, so I can see the CPU load go to 0... then jump
 at about the minute Cassandra reports the dropped messages,

 -Mike

 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:

 The messages appear right after the node wakes up.

 Are you tracking CPU steal ?

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com
 wrote:

 Another related question.  Once we see messages being dropped on one
 node, our cassandra client appears to see this, reporting errors.  We use
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see
 an error?  If only one node reports an error, shouldn't the consistency
 level prevent the client from seeing an issue?


 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to 

Re: Really odd issue (AWS related?)

2013-04-26 Thread Jason Wee
top command? st : time stolen from this vm by the hypervisor

jason


On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.comwrote:

 Sorry, Not sure what CPU steal is :)

 I have AWS console with detailed monitoring enabled... things seem to
 track close to the minute, so I can see the CPU load go to 0... then jump
 at about the minute Cassandra reports the dropped messages,

 -Mike

 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:

 The messages appear right after the node wakes up.

 Are you tracking CPU steal ?

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com
 wrote:

 Another related question.  Once we see messages being dropped on one node,
 our cassandra client appears to see this, reporting errors.  We use
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see
 an error?  If only one node reports an error, shouldn't the consistency
 level prevent the client from seeing an issue?


 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.

 =Rob






Re: Really odd issue (AWS related?)

2013-04-26 Thread Michael Theroux
Thanks.

We weren't monitoring this value when the issue occurred, and this particular 
issue has not appeared for a couple of days (knock on wood).  Will keep an eye 
out though,

-Mike

On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:

 top command? st : time stolen from this vm by the hypervisor
 
 jason
 
 
 On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com wrote:
 Sorry, Not sure what CPU steal is :)
 
 I have AWS console with detailed monitoring enabled... things seem to track 
 close to the minute, so I can see the CPU load go to 0... then jump at about 
 the minute Cassandra reports the dropped messages,
 
 -Mike
 
 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
 
 The messages appear right after the node wakes up.
 Are you tracking CPU steal ? 
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:
 
 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Another related question.  Once we see messages being dropped on one node, 
 our cassandra client appears to see this, reporting errors.  We use 
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would 
 see an error?  If only one node reports an error, shouldn't the 
 consistency level prevent the client from seeing an issue?
 
 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.
 
 =Rob
 
 
 



Re: Really odd issue (AWS related?)

2013-04-25 Thread aaron morton
 The messages appear right after the node wakes up.
Are you tracking CPU steal ? 

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com wrote:
 Another related question.  Once we see messages being dropped on one node, 
 our cassandra client appears to see this, reporting errors.  We use 
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see 
 an error?  If only one node reports an error, shouldn't the consistency 
 level prevent the client from seeing an issue?
 
 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.
 
 =Rob



Re: Really odd issue (AWS related?)

2013-04-25 Thread Michael Theroux
Sorry, Not sure what CPU steal is :)

I have AWS console with detailed monitoring enabled... things seem to track 
close to the minute, so I can see the CPU load go to 0... then jump at about 
the minute Cassandra reports the dropped messages,

-Mike

On Apr 25, 2013, at 9:50 PM, aaron morton wrote:

 The messages appear right after the node wakes up.
 Are you tracking CPU steal ? 
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:
 
 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com wrote:
 Another related question.  Once we see messages being dropped on one node, 
 our cassandra client appears to see this, reporting errors.  We use 
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see 
 an error?  If only one node reports an error, shouldn't the consistency 
 level prevent the client from seeing an issue?
 
 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.
 
 =Rob
 



Really odd issue (AWS related?)

2013-04-24 Thread Michael Theroux
Hello,

Since Sunday, we've been experiencing a really odd issue in our Cassandra 
cluster.  We recently started receiving errors that messages are being dropped. 
 But here is the odd part...

When looking in the AWS console, instead of seeing statistics being elevated 
during this time, we actually see all statistics suddenly drop right before 
these messages appear.  CPU, I/O, and network go way down.  In fact, in one 
case, they went to 0 for about 5 minutes to the point that other cassandra 
nodes saw this specific node in question as being down.  The messages appear 
right after the node wakes up.

We've had this happen on 3 different nodes on three different days since Sunday.

Other facts:

- We recently upgraded from m1.large to m1.xlarge instances about two weeks ago.
- We are running Cassandra 1.1.9
- We've been doing some memory tuning, although I have seen this happen on 
untuned nodes.

Has anyone seen anything like this before?

Another related question.  Once we see messages being dropped on one node, our 
cassandra client appears to see this, reporting errors.  We use LOCAL_QUORUM 
with a RF of 3 on all queries.  Any idea why clients would see an error?  If 
only one node reports an error, shouldn't the consistency level prevent the 
client from seeing an issue?

Thanks for your help,
-Mike

Re: Really odd issue (AWS related?)

2013-04-24 Thread Robert Coli
On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com wrote:
 Another related question.  Once we see messages being dropped on one node, 
 our cassandra client appears to see this, reporting errors.  We use 
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see 
 an error?  If only one node reports an error, shouldn't the consistency level 
 prevent the client from seeing an issue?

If the client is talking to a broken/degraded coordinator node, RF/CL
are unable to protect it from RPCTimeout. If it is unable to
coordinate the request in a timely fashion, your clients will get
errors.

=Rob