Any other ideas? If I simply stop the node, there is no latency problem,
but once I start the node the problem appears. This happens consistently
for all nodes in the cluster

On Wed, Feb 7, 2018 at 11:36 AM, Mike Torra <mto...@salesforce.com> wrote:

> No, I am not
>
> On Wed, Feb 7, 2018 at 11:35 AM, Jeff Jirsa <jji...@gmail.com> wrote:
>
>> Are you using internode ssl?
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Feb 7, 2018, at 8:24 AM, Mike Torra <mto...@salesforce.com> wrote:
>>
>> Thanks for the feedback guys. That example data model was indeed
>> abbreviated - the real queries have the partition key in them. I am using
>> RF 3 on the keyspace, so I don't think a node being down would mean the key
>> I'm looking for would be unavailable. The load balancing policy of the
>> driver seems correct (https://docs.datastax.com/en/
>> developer/nodejs-driver/3.4/features/tuning-policies/#load-
>> balancing-policy, and I am using the default `TokenAware` policy with
>> `DCAwareRoundRobinPolicy` as a child), but I will look more closely at the
>> implementation.
>>
>> It was an oversight of mine to not include `nodetool disablebinary`, but
>> I still experience the same issue with that.
>>
>> One other thing I've noticed is that after restarting a node and seeing
>> application latency, I also see that the node I just restarted sees many
>> other nodes in the same DC as being down (ie status 'DN'). However,
>> checking `nodetool status` on those other nodes shows all nodes as
>> up/normal. To me this could kind of explain the problem - node comes back
>> online, thinks it is healthy but many others are not, so it gets traffic
>> from the client application. But then it gets requests for ranges that
>> belong to a node it thinks is down, so it responds with an error. The
>> latency issue seems to start roughly when the node goes down, but persists
>> long (ie 15-20 mins) after it is back online and accepting connections. It
>> seems to go away once the bounced node shows the other nodes in the same DC
>> as up again.
>>
>> As for speculative retry, my CF is using the default of '99th
>> percentile'. I could try something different there, but nodes being seen as
>> down seems like an issue.
>>
>> On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>>
>>> Unless you abbreviated, your data model is questionable (SELECT without
>>> any equality in the WHERE clause on the partition key will always cause a
>>> range scan, which is super inefficient). Since you're doing LOCAL_ONE and a
>>> range scan, timeouts sorta make sense - the owner of at least one range
>>> would be down for a bit.
>>>
>>> If you actually have a partition key in your where clause, then the next
>>> most likely guess is your clients aren't smart enough to route around the
>>> node as it restarts, or your key cache is getting cold during the bounce.
>>> Double check your driver's load balancing policy.
>>>
>>> It's also likely the case that speculative retry may help other nodes
>>> route around the bouncing instance better - if you're not using it, you
>>> probably should be (though with CL: LOCAL_ONE, it seems like it'd be less
>>> of an issue).
>>>
>>> We need to make bouncing nodes easier (or rather, we need to make drain
>>> do the right thing), but in this case, your data model looks like the
>>> biggest culprit (unless it's an incomplete recreation).
>>>
>>> - Jeff
>>>
>>>
>>> On Tue, Feb 6, 2018 at 10:58 AM, Mike Torra <mto...@salesforce.com>
>>> wrote:
>>>
>>>> Hi -
>>>>
>>>> I am running a 29 node cluster spread over 4 DC's in EC2, using C*
>>>> 3.11.1 on Ubuntu. Occasionally I have the need to restart nodes in the
>>>> cluster, but every time I do, I see errors and application (nodejs)
>>>> timeouts.
>>>>
>>>> I restart a node like this:
>>>>
>>>> nodetool disablethrift && nodetool disablegossip && nodetool drain
>>>> sudo service cassandra restart
>>>>
>>>> When I do that, I very often get timeouts and errors like this in my
>>>> nodejs app:
>>>>
>>>> Error: Cannot achieve consistency level LOCAL_ONE
>>>>
>>>> My queries are all pretty much the same, things like: "select * from
>>>> history where ts > {current_time}"
>>>>
>>>> The errors and timeouts seem to go away on their own after a while, but
>>>> it is frustrating because I can't track down what I am doing wrong!
>>>>
>>>> I've tried waiting between steps of shutting down cassandra, and I've
>>>> tried stopping, waiting, then starting the node. One thing I've noticed is
>>>> that even after `nodetool drain`ing the node, there are open connections to
>>>> other nodes in the cluster (ie looking at the output of netstat) until I
>>>> stop cassandra. I don't see any errors or warnings in the logs.
>>>>
>>>> What can I do to prevent this? Is there something else I should be
>>>> doing to gracefully restart the cluster? It could be something to do with
>>>> the nodejs driver, but I can't find anything there to try.
>>>>
>>>> I appreciate any suggestions or advice.
>>>>
>>>> - Mike
>>>>
>>>
>>>
>>
>

Reply via email to