Re: system_auth keyspace replication factor

2018-11-23 Thread Vitali Dyachuk
Attaching the runner log snippet, where we can see that "Rebuilding token
map" took most of the time.
getAllroles is using quorum, don't if it is used during login
https://github.com/apache/cassandra/blob/cc12665bb7645d17ba70edcf952ee6a1ea63127b/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L260

Vitali Djatsuk,
On Fri, Nov 23, 2018 at 8:32 PM Jeff Jirsa  wrote:

> I suspect some of the intermediate queries (determining role, etc) happen
> at quorum in 2.2+, but I don’t have time to go read the code and prove it.
>
> In any case, RF > 10 per DC is probably excessive
>
> Also want to crank up the validity times so it uses cached info longer
>
>
> --
> Jeff Jirsa
>
>
> On Nov 23, 2018, at 10:18 AM, Vitali Dyachuk  wrote:
>
> no its not a cassandra user and as i understood all other users login
> local_one.
>
> On Fri, 23 Nov 2018, 19:30 Jonathan Haddad 
>> Any chance you’re logging in with the Cassandra user? It uses quorum
>> reads.
>>
>>
>> On Fri, Nov 23, 2018 at 11:38 AM Vitali Dyachuk 
>> wrote:
>>
>>> Hi,
>>> We have recently met a problem when we added 60 nodes in 1 region to the
>>> cluster
>>> and set an RF=60 for the system_auth ks, following this documentation
>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
>>> However we've started to see increased login latencies in the cluster 5x
>>> bigger than before changing RF of system_auth ks.
>>> We have casandra runner written is csharp, running against the cluster,
>>> when analyzing the logs we notices that   Rebuilding token map  is
>>> taking most of the time ~20s.
>>> When we changed RF to 3 the issue has resolved.
>>> We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
>>> version="3.2.1"
>>> I've found somehow related to my problem ticket
>>> https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
>>> related tickets, that the issue with the token map rebuild time has been
>>> fixed in the previous versions of the driver.
>>> So my question is what is the best recommendation of the setting
>>> system_auth ks RF?
>>>
>>> Regards,
>>> Vitali Djatsuk.
>>>
>>>
>>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> twitter: rustyrazorblade
>>
>
ControlConnection: 11/22/2018 10:30:32.170 +00:00 : Trying to connect the 
ControlConnection
TcpSocket: 11/22/2018 10:30:32.170 +00:00 Socket connected, starting SSL client 
authentication
TcpSocket: 11/22/2018 10:30:32.170 +00:00 Starting SSL authentication
TcpSocket: 11/22/2018 10:30:32.217 +00:00 SSL authentication successful
Connection: 11/22/2018 10:30:32.217 +00:00 Sending #0 for StartupRequest to 
node1:9042
Connection: 11/22/2018 10:30:32.217 +00:00 Received #0 from node1:9042
Connection: 11/22/2018 10:30:32.217 +00:00 Sending #0 for AuthResponseRequest 
to node1:9042
Connection: 11/22/2018 10:30:32.329 +00:00 Received #0 from node1:9042
ControlConnection: 11/22/2018 10:30:32.329 +00:00 : Connection established to 
node1:9042
Connection: 11/22/2018 10:30:32.329 +00:00 Sending #0 for 
RegisterForEventRequest to node1:9042
Connection: 11/22/2018 10:30:32.329 +00:00 Received #0 from node1:9042
ControlConnection: 11/22/2018 10:30:32.329 +00:00 : Refreshing node list
Connection: 11/22/2018 10:30:32.329 +00:00 Sending #0 for QueryRequest to 
node1:9042
Connection: 11/22/2018 10:30:32.342 +00:00 Received #0 from node1:9042
Connection: 11/22/2018 10:30:32.342 +00:00 Sending #0 for QueryRequest to 
node1:9042
Connection: 11/22/2018 10:30:32.373 +00:00 Received #0 from node1:9042
ControlConnection: 11/22/2018 10:30:32.389 +00:00 : Node list retrieved 
successfully
ControlConnection: 11/22/2018 10:30:32.389 +00:00 : Retrieving keyspaces 
metadata
Connection: 11/22/2018 10:30:32.389 +00:00 Sending #0 for QueryRequest to 
node1:9042
Connection: 11/22/2018 10:30:32.389 +00:00 Received #0 from node1:9042
ControlConnection: 11/22/2018 10:30:32.389 +00:00 : Rebuilding token map
Cluster: 11/22/2018 10:30:55.233 +00:00 : Cluster Connected using binary 
protocol version: [4]

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: system_auth keyspace replication factor

2018-11-23 Thread Jeff Jirsa
I suspect some of the intermediate queries (determining role, etc) happen at 
quorum in 2.2+, but I don’t have time to go read the code and prove it. 

In any case, RF > 10 per DC is probably excessive

Also want to crank up the validity times so it uses cached info longer


-- 
Jeff Jirsa


> On Nov 23, 2018, at 10:18 AM, Vitali Dyachuk  wrote:
> 
> no its not a cassandra user and as i understood all other users login 
> local_one.
> 
>> On Fri, 23 Nov 2018, 19:30 Jonathan Haddad > Any chance you’re logging in with the Cassandra user? It uses quorum reads. 
>> 
>> 
>>> On Fri, Nov 23, 2018 at 11:38 AM Vitali Dyachuk  wrote:
>>> Hi,
>>> We have recently met a problem when we added 60 nodes in 1 region to the 
>>> cluster
>>> and set an RF=60 for the system_auth ks, following this documentation 
>>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
>>> However we've started to see increased login latencies in the cluster 5x 
>>> bigger than before changing RF of system_auth ks.
>>> We have casandra runner written is csharp, running against the cluster, 
>>> when analyzing the logs we notices that  Rebuilding token map  is taking 
>>> most of the time ~20s.
>>> When we changed RF to 3 the issue has resolved.
>>> We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver" 
>>> version="3.2.1"   
>>> I've found somehow related to my problem ticket 
>>> https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the 
>>> related tickets, that the issue with the token map rebuild time has been 
>>> fixed in the previous versions of the driver.
>>> So my question is what is the best recommendation of the setting 
>>> system_auth ks RF?
>>> 
>>> Regards,
>>> Vitali Djatsuk.
>>> 
>>> 
>> -- 
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> twitter: rustyrazorblade


Re: system_auth keyspace replication factor

2018-11-23 Thread Vitali Dyachuk
no its not a cassandra user and as i understood all other users login
local_one.

On Fri, 23 Nov 2018, 19:30 Jonathan Haddad  Any chance you’re logging in with the Cassandra user? It uses quorum
> reads.
>
>
> On Fri, Nov 23, 2018 at 11:38 AM Vitali Dyachuk 
> wrote:
>
>> Hi,
>> We have recently met a problem when we added 60 nodes in 1 region to the
>> cluster
>> and set an RF=60 for the system_auth ks, following this documentation
>> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
>> However we've started to see increased login latencies in the cluster 5x
>> bigger than before changing RF of system_auth ks.
>> We have casandra runner written is csharp, running against the cluster,
>> when analyzing the logs we notices that   Rebuilding token map  is
>> taking most of the time ~20s.
>> When we changed RF to 3 the issue has resolved.
>> We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
>> version="3.2.1"
>> I've found somehow related to my problem ticket
>> https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
>> related tickets, that the issue with the token map rebuild time has been
>> fixed in the previous versions of the driver.
>> So my question is what is the best recommendation of the setting
>> system_auth ks RF?
>>
>> Regards,
>> Vitali Djatsuk.
>>
>>
>> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>


Re: system_auth keyspace replication factor

2018-11-23 Thread Jonathan Haddad
Any chance you’re logging in with the Cassandra user? It uses quorum reads.


On Fri, Nov 23, 2018 at 11:38 AM Vitali Dyachuk  wrote:

> Hi,
> We have recently met a problem when we added 60 nodes in 1 region to the
> cluster
> and set an RF=60 for the system_auth ks, following this documentation
> https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
> However we've started to see increased login latencies in the cluster 5x
> bigger than before changing RF of system_auth ks.
> We have casandra runner written is csharp, running against the cluster,
> when analyzing the logs we notices that   Rebuilding token map  is taking
> most of the time ~20s.
> When we changed RF to 3 the issue has resolved.
> We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
> version="3.2.1"
> I've found somehow related to my problem ticket
> https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
> related tickets, that the issue with the token map rebuild time has been
> fixed in the previous versions of the driver.
> So my question is what is the best recommendation of the setting
> system_auth ks RF?
>
> Regards,
> Vitali Djatsuk.
>
>
> --
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade


system_auth keyspace replication factor

2018-11-23 Thread Vitali Dyachuk
Hi,
We have recently met a problem when we added 60 nodes in 1 region to the
cluster
and set an RF=60 for the system_auth ks, following this documentation
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html
However we've started to see increased login latencies in the cluster 5x
bigger than before changing RF of system_auth ks.
We have casandra runner written is csharp, running against the cluster,
when analyzing the logs we notices that   Rebuilding token map  is taking
most of the time ~20s.
When we changed RF to 3 the issue has resolved.
We are using C* 3.0.17 , 4 DC, system_auth RF=3, "CassandraCSharpDriver"
version="3.2.1"
I've found somehow related to my problem ticket
https://datastax-oss.atlassian.net/browse/CSHARP-436 but it says in the
related tickets, that the issue with the token map rebuild time has been
fixed in the previous versions of the driver.
So my question is what is the best recommendation of the setting
system_auth ks RF?

Regards,
Vitali Djatsuk.


Re: [EXTERNAL] Availability issues for write/update/read workloads (up to 100s downtime) in case of a Cassandra node failure

2018-11-23 Thread Daniel Seybold

Hi Alexander,

thanks a lot for the pointers, I checked the mentioned issue.

While the reported issue seems to match our problem it only occurs reads 
and not for writes (according to the Datastax Jira). But we experience 
downtimes for writes and reads.



Which version of the Datastax Driver are you using for your tests?

We use version 3.0.0

But I have also tried version 3.2.0 to avoid your mentioned JAVA-1346 
issue, but still the same behaviour with respect to the downtime.



How is it configured (load balancing policies, etc...) ?

Besides the write consistency of ONE it uses the default settings.

As we use the YCSB as workload for our experiments, you can have a look 
at the driver settings in the basic class: 
https://github.com/brianfrankcooper/YCSB/blob/master/cassandra/src/main/java/com/yahoo/ycsb/db/CassandraCQLClient.java 




Do you have some debug logs on the client side that could help?

On client side the logs shows no exceptions or any suspicious messages.

I also turned on the tracing but didn't find any suspicious messages 
(yet I did not spend too much time in that and I am no expert the 
Cassandra Driver)


If more detailed logs or the traces would help to further investigate 
the issue let me know and I will rerun the experiments to create the 
logs and traces.


Many thanks again for your help.

Cheers,

Daniel


Am 16.11.2018 um 15:08 schrieb Alexander Dejanovski:

Hi Daniel,

it seems like the driver isn't detecting that the node went down, 
which is probably due to the way the node is being killed.
If I remember correctly, in some cases Netty transport is still up in 
the client, which will still allows to send queries without them 
answering back : https://datastax-oss.atlassian.net/browse/JAVA-1346

Eventually, the node gets discarded when the heartbeat system catches up.
It's also possible that the stuck queries then eat up all the 
available slots in the driver, preventing any other query to be sent 
in that JVM.


Which version of the Datastax Driver are you using for your tests?
How is it configured (load balancing policies, etc...) ?
Do you have some debug logs on the client side that could help?

Thanks,


On Fri, Nov 16, 2018 at 1:19 PM Daniel Seybold 
mailto:daniel.seyb...@uni-ulm.de>> wrote:


Hi Sean,

thanks for your comments, find below some more details with
respect to the (1) VM sizing and (2) the replication factor:

(1) VM sizing:

We selected the small VMs as intial setup to run our experiments.
We have also executed the same experiments (5 nodes) on larger VMs
with 6 cores and 12GB memory (where 6GB was allocated to Cassandra).

We use the default CMS garbace collector (with default settings)
and the debug.log and system.log does not show any suspicious GC
messages.

(2) Replication factor

We set the RF to 5 as we want to emulate a scenario which is able
to survive multiple-node failures. We have also tried a RF of 3
(in the 5 node cluster) but the downtime in case of a node failure
persists.


I also attached two plots which show the results with the
downtimes for using the larger VMs and setting the RF to 3

Any further comments much appreciated,

Cheers,
Daniel


Am 09.11.2018 um 19:04 schrieb Durity, Sean R:


The VMs’ memory (4 GB) seems pretty small for Cassandra. What
heap size are you using? Which garbage collector? Are you seeing
long GC times on the nodes? The basic rule of thumb is to give
the Cassandra heap 50% of the RAM on the host. 2 GB isn’t very much.

Also, I wouldn’t set the replication factor to 5 (the number of
nodes). If RF is always equal to the number of nodes, you can’t
really scale beyond the size of the disk on any one node (all
data is on each node). A replication factor of 3 would be more
like a typical production set-up.

Sean Durity

*From:*Daniel Seybold 

*Sent:* Friday, November 09, 2018 5:49 AM
*To:* user@cassandra.apache.org 
*Subject:* [EXTERNAL] Availability issues for write/update/read
workloads (up to 100s downtime) in case of a Cassandra node failure

Hi Apache Cassandra experts,

we are running a set of availability evaluations under a
write/read/update workloads with Apache Cassandra and experience
some unexpected results, i.e.  0 ops/s over a period up to 100s.

In order to provide a clear picture find below the details of (1)
the setup and (2) the evaluation workflow

*1. Setup:*

Cassandra version: 3.11.2
Cluster size: 5 nodes
Replication Factor: 5
Each nodes runs in the same private OpenStack based cloud, within
the same availability zone and uses the private network.
Each nodes runs as OS Ubuntu 16.04 server and has 2 cores, 4GB
RAM and 50GB disk.

Workload:
Yahoo Cloud Serving Benchmark 0.12
W1: 100% write
W2: 100% read