Read Latency Doubles After Shrinking Cluster and Never Recovers

2018-06-11 Thread Fred Habash
I have hit dead-ends every where I turned on this issue. We had a 15-node cluster that was doing 35 ms all along for years. At some point, we made a decision to shrink it to 13. Read latency rose to near 70 ms. Shortly after, we decided this was not acceptable, so we added the three nodes back

Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-23 Thread Fred Habash
. On Feb 21, 2018 1:29 PM, "Fred Habash" <fmhab...@gmail.com> wrote: > One node at a time > > On Feb 21, 2018 10:23 AM, "Carl Mueller" <carl.muel...@smartthings.com> > wrote: > >> What is your replication factor? >> Single datacenter,

Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-21 Thread Fred Habash
One node at a time On Feb 21, 2018 10:23 AM, "Carl Mueller" wrote: > What is your replication factor? > Single datacenter, three availability zones, is that right? > You removed one node at a time or three at once? > > On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash

Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-21 Thread Fred Habash
RF of 3 with three racs AZ's in a single region. On Feb 21, 2018 10:23 AM, "Carl Mueller" wrote: > What is your replication factor? > Single datacenter, three availability zones, is that right? > You removed one node at a time or three at once? > > On Wed, Feb 21,

Re: Cassandra upgrade from 2.2.8 to 3.10

2018-03-28 Thread Fred Habash
d third DC with > version 3.10 installed, will nodes in DC3 join the cluster with data > without issues? > > > > Thanks/Asad > > > > > > > > > -- Thank you ... Fred Habash, Database Solutions Architect (Oracle OCP 8i,9i,10g,11g)

Long-running Job to Extract Data Timesout After ~ 60 Hours

2018-10-03 Thread Fred Habash
We tried to extract large volume of data from a 42-node cluster about three times and in all attempts, client sessions aborts after ~ 60 hours. Here's what we see in in the client logs I have reviewed the multiple timeout settings in C*, but none seemed to relate to the 60 hrs limit. What is

Timestamp of Last Repair

2018-12-11 Thread Fred Habash
We are trying to detect a scenario where some of our smaller clusters go un-repaired for extended periods of times mostly due to defects in deployment pipelines or human errors. We would like to automate a check for clusters where nodes that go un-repaired for more than 7 days, to shoot out an

Re: Timestamp of Last Repair

2018-12-18 Thread Fred Habash
4119c9d8] > new session: > RepairSession.java (line 282) [repair #2e7009b0-c03d-11e4-9012-99a64119c9d8] > session completed successfully > > 2. In table you can check: started_at and finished_at field in > system_distributed.parent_repair_history > > regards, > Laxmikant

Re: Bootstrapping to Replace a Dead Node vs. Adding a New Node: Consistency Guarantees

2019-05-01 Thread Fred Habash
Thank you. Range movement is one reason this is enforced when adding a new node. But, what about forcing a consistent bootstrap i.e. bootstrapping from primary owner of the range and not a secondary replica. How’s consistent bootstrap enforced when replacing a dead node. - Thank you.

Re: Bootstrapping to Replace a Dead Node vs. Adding a New Node: Consistency Guarantees

2019-05-01 Thread Fred Habash
I, probably, should've been clearer in my inquiry ... I'm investigating a scenario where our diagnostic data is tell us that a small portion of application data has been lost. I mean, getsstables for the keys returns zero on all cluster nodes. The last pickle article below (which includes a case

Re: CL=LQ, RF=3: Can a Write be Lost If Two Nodes ACK'ing it Die

2019-05-03 Thread Fred Habash
Thank you all. So, please, bear with me for a second. I'm trying to figure out how can data be totally lost under the above circumstances when nodes die in two out of three racks. You stated "the replica may or many not have made its way to the third node '. Why 'may not'? This is what I

Predicting Read/Write Latency as a Function of Total Requests & Cluster Size

2019-12-10 Thread Fred Habash
I'm looking for an empirical way to answer these two question: 1. If I increase application work load (read/write requests) by some percentage, how is it going to affect read/write latency. Of course, all other factors remaining constant e.g. ec2 instance class, ssd specs, number of nodes, etc.

Measuring Cassandra Metrics at a Sessions/Connection Levels

2019-12-12 Thread Fred Habash
Hi all ... We are facing a scenario where we have to measure for some metrics on a per connection or client basis. For example. count of read/write request by client IP/host/user/program. We want to know the source of C* requests for budgeting, capacity planing, or charge-backs. We are running

Re: Soon After Starting c* Process: CPU 100% for java Process

2021-07-01 Thread Fred Habash
at startup. > > On Thu, Jul 1, 2021 at 5:14 AM Fred Habash wrote: > >> I have node in cluster when I start c, the cpu reaches 100% with java >> process on top. Within a few minutes, jvm crashes (jvm instability) >> messages in system.log and c* crashes. >> >&

Soon After Starting c* Process: CPU 100% for java Process

2021-06-30 Thread Fred Habash
I have node in cluster when I start c, the cpu reaches 100% with java process on top. Within a few minutes, jvm crashes (jvm instability) messages in system.log and c* crashes. Once c* is up, cluster average read latency reaches multi-seconds and client apps are unhappy. For now, the only way out

C* Jave Clients Gettting 'java.lang.IllegalStateException: Queue full'

2023-06-16 Thread Fred Habash
A java service client app reported getting this error message. Initially, I thought of it as a C* emitting the error back to the client. But searching the C* logs (system/gc/debug) for 'queue full' or some variation of it returned zero instances. I have seen some log snippets on the web where

Re: C* Jave Clients Gettting 'java.lang.IllegalStateException: Queue full'

2023-06-19 Thread Fred Habash
Just wondering if my inquiry requires further details to warrant some interest. Hope someone else out there has had a similar experience. On Fri, Jun 16, 2023 at 2:55 PM Fred Habash wrote: > A java service client app reported getting this error message. Initially, > I thought of it