[jira] [Comment Edited] (CASSANDRA-17401) Race condition in QueryProcessor causes just prepared statement not to be in the prepared statements cache

2024-02-28 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821743#comment-17821743
 ] 

Alex Petrov edited comment on CASSANDRA-17401 at 2/28/24 4:17 PM:
--

[~paulo] I have ran both regular and mixed mode fuzz tests and could not make 
them fail with this change. I've also looked at the code again and it looks 
like in cases where we would previously evict, in the new version we would just 
insert the missing one. That said, I do not see this as a correctness issue 
(not that it should not be still improved upon), so I would probably focus on 
17248.


was (Author: ifesdjeen):
[~paulo] I have ran both tests and could not make them fail with this change. 
I've also looked at the code again and it looks like in cases where we would 
previously evict, in the new version we would just insert the missing one. That 
said, I do not see this as a correctness issue (not that it should not be still 
improved upon), so I would probably focus on 17248.

> Race condition in QueryProcessor causes just prepared statement not to be in 
> the prepared statements cache
> --
>
> Key: CASSANDRA-17401
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17401
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Ivan Senic
>Assignee: Jaydeepkumar Chovatia
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The changes in the 
> [QueryProcessor#prepare|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L575-L638]
>  method that were introduced in versions *4.0.2* and *3.11.12* can cause a 
> race condition between two threads trying to concurrently prepare the same 
> statement. This race condition can cause removing of a prepared statement 
> from the cache, after one of the threads has received the result of the 
> prepare and eventually uses MD5Digest to call 
> [QueryProcessor#getPrepared|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L212-L215].
> The race condition looks like this:
>  * Thread1 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 executes eviction of hashes
>  * Thread2 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 prepares the statement and caches it
>  * Thread1 returns the result of the prepare
>  * Thread2 executes eviction of hashes
>  * Thread1 tries to execute the prepared statement with the received 
> MD5Digest, but statement is not in the cache as it was evicted by Thread2
> I tried to reproduce this by using a Java driver, but hitting this case from 
> a client side is highly unlikely and I can not simulate the needed race 
> condition. However, we can easily reproduce this in Stargate (details 
> [here|https://github.com/stargate/stargate/pull/1647]), as it's closer to 
> QueryProcessor.
> Reproducing this in a unit test is fairly easy. I am happy to showcase this 
> if needed.
> Note that the issue can occur only when  safeToReturnCached is resolved as 
> false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-17401) Race condition in QueryProcessor causes just prepared statement not to be in the prepared statements cache

2024-01-22 Thread Long Pan (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809755#comment-17809755
 ] 

Long Pan edited comment on CASSANDRA-17401 at 1/23/24 6:39 AM:
---

[~chovatia.jayd...@gmail.com] and I managed to reproduce the issue. Basically a 
large number of QPS and client connections are necessary to reproduce it. Here 
are the steps to reproduce:

*Server Setup (Cassandra 4.0.6):*
A 3-node Cassandra cluster. Each node with 64GB mem (16GB heap), 7 CPU cores. 
({{{}native_transport_max_threads = 1024).{}}}

*Keyspace/Table:*
CREATE KEYSPACE test_ks WITH REPLICATION = \{ ‘class’ : 
‘NetworkTopologyStrategy’, ‘datacenter1’ : 3 } ;
CREATE TABLE test_ks.table1 ( p_id text, c_id text, v text);

*Client Setup:*
30 hosts (12 CPU cores per host). Each host run the following pseudo-code, 
using *GoCql* client:
{code:java}
cluster.CQLVersion = "3.4.0"
cluster.ProtoVersion = 4
cluster.Timeout = 5s
cluster.ConnectTimeout = 10s
cluster.NumConns = 3
cluster.Consistency = LocalQuorum
cluster.RetryPolicy = SimpleRetryPolicy{NumRetries: 1}
cluster.SocketKeepalive = 20s
cluster.HostSelectionPolicy = RoundRobinHostPolicy

sessionCount = 30
qpsPerSession = 30
cqlQuery = "SELECT p_id,c_id,v FROM test_ks.table1 WHERE p_id = ? AND c_id 
= ?"
for (i = 0; i < sessionCount; i++) {
   session = cluster.createSession
   rateLimiter = NewRateLimiter(qpsPerSession)
   newGoRoutine.run( sendReads(session, rateLimiter) )
}

/ *
  sendReads(session, rateLimiter) {
 for {
newGoRoutine.run (
   if (rateLimiter.allow) {
  session.execute(cqlQuery, randomString, randomString)
   }
)
 }
  }
*/ {code}
Traffic generated this way will result in ~10K coordiator QPS and ~3k client 
connections per Cassandra node.

*Trigger Point:*
Manually issue a CQL query to add a column in the table: “ALTER TABLE 
test_ks.table1 ADD new_col text;”

{*}Symmpton{*}:
Seconds after the trigger point, one or more Cassandra nodes will show number 
of native_transport threads reaching {{{}native_transport_max_threads{}}}, and 
pending native transport tasks grow endlessly.


was (Author: JIRAUSER303782):
[~chovatia.jayd...@gmail.com] and I managed to reproduce the issue. Basically a 
large number of QPS and client connections are necessary to reproduce it. Here 
are the steps to reproduce:

*Server Setup (Cassandra 4.0.6):*
A 3-node Cassandra cluster. Each node with 64GB mem (16GB heap), 7 CPU cores. 
({{{}native_transport_max_threads = 1024).{}}}

*Keyspace/Table:*
CREATE KEYSPACE test_ks WITH REPLICATION = \{ ‘class’ : 
‘NetworkTopologyStrategy’, ‘datacenter1’ : 3 } ;
CREATE TABLE test_ks.table1 ( p_id text, c_id text, v text);

*Client Setup:*
30 hosts. Each host run the following pseudo-code, using *GoCql* client:
{code:java}
cluster.CQLVersion = "3.4.0"
cluster.ProtoVersion = 4
cluster.Timeout = 5s
cluster.ConnectTimeout = 10s
cluster.NumConns = 3
cluster.Consistency = LocalQuorum
cluster.RetryPolicy = SimpleRetryPolicy{NumRetries: 1}
cluster.SocketKeepalive = 20s
cluster.HostSelectionPolicy = RoundRobinHostPolicy

sessionCount = 30
qpsPerSession = 30
cqlQuery = "SELECT p_id,c_id,v FROM test_ks.table1 WHERE p_id = ? AND c_id 
= ?"
for (i = 0; i < sessionCount; i++) {
   session = cluster.createSession
   rateLimiter = NewRateLimiter(qpsPerSession)
   newGoRoutine.run( sendReads(session, rateLimiter) )
}

/ *
  sendReads(session, rateLimiter) {
 for {
newGoRoutine.run (
   if (rateLimiter.allow) {
  session.execute(cqlQuery, randomString, randomString)
   }
)
 }
  }
*/ {code}
Traffic generated this way will result in ~10K coordiator QPS and ~3k client 
connections per Cassandra node.

*Trigger Point:*
Manually issue a CQL query to add a column in the table: “ALTER TABLE 
test_ks.table1 ADD new_col text;”

{*}Symmpton{*}:
Seconds after the trigger point, one or more Cassandra nodes will show number 
of native_transport threads reaching {{{}native_transport_max_threads{}}}, and 
pending native transport tasks grow endlessly.

> Race condition in QueryProcessor causes just prepared statement not to be in 
> the prepared statements cache
> --
>
> Key: CASSANDRA-17401
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17401
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Ivan Senic
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The changes in the 
> 

[jira] [Comment Edited] (CASSANDRA-17401) Race condition in QueryProcessor causes just prepared statement not to be in the prepared statements cache

2024-01-22 Thread Long Pan (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809755#comment-17809755
 ] 

Long Pan edited comment on CASSANDRA-17401 at 1/23/24 6:36 AM:
---

[~chovatia.jayd...@gmail.com] and I managed to reproduce the issue. Basically a 
large number of QPS and client connections are necessary to reproduce it. Here 
are the steps to reproduce:

*Server Setup (Cassandra 4.0.6):*
A 3-node Cassandra cluster. Each node with 64GB mem (16GB heap), 7 CPU cores. 
({{{}native_transport_max_threads = 1024).{}}}

*Keyspace/Table:*
CREATE KEYSPACE test_ks WITH REPLICATION = \{ ‘class’ : 
‘NetworkTopologyStrategy’, ‘datacenter1’ : 3 } ;
CREATE TABLE test_ks.table1 ( p_id text, c_id text, v text);

*Client Setup:*
30 hosts. Each host run the following pseudo-code, using *GoCql* client:
{code:java}
cluster.CQLVersion = "3.4.0"
cluster.ProtoVersion = 4
cluster.Timeout = 5s
cluster.ConnectTimeout = 10s
cluster.NumConns = 3
cluster.Consistency = LocalQuorum
cluster.RetryPolicy = SimpleRetryPolicy{NumRetries: 1}
cluster.SocketKeepalive = 20s
cluster.HostSelectionPolicy = RoundRobinHostPolicy

sessionCount = 30
qpsPerSession = 30
cqlQuery = "SELECT p_id,c_id,v FROM test_ks.table1 WHERE p_id = ? AND c_id 
= ?"
for (i = 0; i < sessionCount; i++) {
   session = cluster.createSession
   rateLimiter = NewRateLimiter(qpsPerSession)
   newGoRoutine.run( sendReads(session, rateLimiter) )
}

/ *
  sendReads(session, rateLimiter) {
 for {
newGoRoutine.run (
   if (rateLimiter.allow) {
  session.execute(cqlQuery, randomString, randomString)
   }
)
 }
  }
*/ {code}
Traffic generated this way will result in ~10K coordiator QPS and ~3k client 
connections per Cassandra node.

*Trigger Point:*
Manually issue a CQL query to add a column in the table: “ALTER TABLE 
test_ks.table1 ADD new_col text;”

{*}Symmpton{*}:
Seconds after the trigger point, one or more Cassandra nodes will show number 
of native_transport threads reaching {{{}native_transport_max_threads{}}}, and 
pending native transport tasks grow endlessly.


was (Author: JIRAUSER303782):
[~chovatia.jayd...@gmail.com] and I managed to reproduce the issue. Basically a 
large number of QPS and client connections are necessary to reproduce it. Here 
are the steps to reproduce:

*Server Setup:*
A 3-node Cassandra cluster. Each node with 64GB mem (16GB heap), 7 CPU cores. 
({{{}native_transport_max_threads = 1024){}}}

*Keyspace/Table:*
CREATE KEYSPACE test_ks WITH REPLICATION = \{ ‘class’ : 
‘NetworkTopologyStrategy’, ‘datacenter1’ : 3 } ;
CREATE TABLE test_ks.table1 ( p_id text, c_id text, v text);

*Client Setup:*
30 hosts. Each host run the following pseudo-code, using *GoCql* client:
{code:java}
cluster.CQLVersion = "3.4.0"
cluster.ProtoVersion = 4
cluster.Timeout = 5s
cluster.ConnectTimeout = 10s
cluster.NumConns = 3
cluster.Consistency = LocalQuorum
cluster.RetryPolicy = SimpleRetryPolicy{NumRetries: 1}
cluster.SocketKeepalive = 20s
cluster.HostSelectionPolicy = RoundRobinHostPolicy

sessionCount = 30
qpsPerSession = 30
cqlQuery = "SELECT p_id,c_id,v FROM test_ks.table1 WHERE p_id = ? AND c_id 
= ?"
for (i = 0; i < sessionCount; i++) {
   session = cluster.createSession
   rateLimiter = NewRateLimiter(qpsPerSession)
   newGoRoutine.run( sendReads(session, rateLimiter) )
}

/ *
  sendReads(session, rateLimiter) {
 for {
newGoRoutine.run (
   if (rateLimiter.allow) {
  session.execute(cqlQuery, randomString, randomString)
   }
)
 }
  }
*/ {code}

Traffic generated this way will result in ~10K coordiator QPS and ~3k client 
connections per Cassandra node.

*Trigger Point:*
Manually issue a CQL query to add a column in the table: “ALTER TABLE 
test_ks.table1 ADD new_col text;”

{*}Symmpton{*}:
Seconds after the trigger point, one or more Cassandra nodes will show number 
of native_transport threads reaching {{{}native_transport_max_threads{}}}, and 
pending native transport tasks grow endlessly.

> Race condition in QueryProcessor causes just prepared statement not to be in 
> the prepared statements cache
> --
>
> Key: CASSANDRA-17401
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17401
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Ivan Senic
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The changes in the 
> 

[jira] [Comment Edited] (CASSANDRA-17401) Race condition in QueryProcessor causes just prepared statement not to be in the prepared statements cache

2024-01-22 Thread Long Pan (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809755#comment-17809755
 ] 

Long Pan edited comment on CASSANDRA-17401 at 1/23/24 6:35 AM:
---

[~chovatia.jayd...@gmail.com] and I managed to reproduce the issue. Basically a 
large number of QPS and client connections are necessary to reproduce it. Here 
are the steps to reproduce:

*Server Setup:*
A 3-node Cassandra cluster. Each node with 64GB mem (16GB heap), 7 CPU cores. 
({{{}native_transport_max_threads = 1024){}}}

*Keyspace/Table:*
CREATE KEYSPACE test_ks WITH REPLICATION = \{ ‘class’ : 
‘NetworkTopologyStrategy’, ‘datacenter1’ : 3 } ;
CREATE TABLE test_ks.table1 ( p_id text, c_id text, v text);

*Client Setup:*
30 hosts. Each host run the following pseudo-code, using *GoCql* client:
{code:java}
cluster.CQLVersion = "3.4.0"
cluster.ProtoVersion = 4
cluster.Timeout = 5s
cluster.ConnectTimeout = 10s
cluster.NumConns = 3
cluster.Consistency = LocalQuorum
cluster.RetryPolicy = SimpleRetryPolicy{NumRetries: 1}
cluster.SocketKeepalive = 20s
cluster.HostSelectionPolicy = RoundRobinHostPolicy

sessionCount = 30
qpsPerSession = 30
cqlQuery = "SELECT p_id,c_id,v FROM test_ks.table1 WHERE p_id = ? AND c_id 
= ?"
for (i = 0; i < sessionCount; i++) {
   session = cluster.createSession
   rateLimiter = NewRateLimiter(qpsPerSession)
   newGoRoutine.run( sendReads(session, rateLimiter) )
}

/ *
  sendReads(session, rateLimiter) {
 for {
newGoRoutine.run (
   if (rateLimiter.allow) {
  session.execute(cqlQuery, randomString, randomString)
   }
)
 }
  }
*/ {code}

Traffic generated this way will result in ~10K coordiator QPS and ~3k client 
connections per Cassandra node.

*Trigger Point:*
Manually issue a CQL query to add a column in the table: “ALTER TABLE 
test_ks.table1 ADD new_col text;”

{*}Symmpton{*}:
Seconds after the trigger point, one or more Cassandra nodes will show number 
of native_transport threads reaching {{{}native_transport_max_threads{}}}, and 
pending native transport tasks grow endlessly.


was (Author: JIRAUSER303782):
[~chovatia.jayd...@gmail.com] and I managed to reproduce the issue. Basically a 
large number of QPS and client connections are necessary to reproduce it. Here 
are the steps to reproduce:


*Server Setup:*
A 3-node Cassandra cluster. Each node with 64GB mem (16GB heap), 7 CPU cores. 
({{{}native_transport_max_threads = 1024){}}}

*Keyspace/Table:*
CREATE KEYSPACE test_ks WITH REPLICATION = \{ ‘class’ : 
‘NetworkTopologyStrategy’, ‘datacenter1’ : 3 } ;
CREATE TABLE test_ks.table1 ( p_id text, c_id text, v text);

*Client Setup:*
30 hosts. Each host run the following pseudo-code, using *GoCql* client:
cluster.CQLVersion = "3.4.0"
cluster.ProtoVersion = 4
cluster.Timeout = 5s
cluster.ConnectTimeout = 10s
cluster.NumConns = 3
cluster.Consistency = LocalQuorum
cluster.RetryPolicy = SimpleRetryPolicy\{NumRetries: 1}
cluster.SocketKeepalive = 20s
cluster.HostSelectionPolicy = RoundRobinHostPolicy

sessionCount = 30
qpsPerSession = 30
cqlQuery = "SELECT p_id,c_id,v FROM test_ks.table1 WHERE p_id = ? AND c_id 
= ?"
for (i = 0; i < sessionCount; i++) \{
   session = cluster.createSession
   rateLimiter = NewRateLimiter(qpsPerSession)
   newGoRoutine.run( sendReads(session, rateLimiter) )
}

/ *
  sendReads(session, rateLimiter) \{
 for {
newGoRoutine.run (
   if (rateLimiter.allow) {
  session.execute(cqlQuery, randomString, randomString)
   }
)
 }
  }
*/
Traffic generated this way will result in ~10K coordiator QPS and ~3k client 
connections per Cassandra node.

*Trigger Point:*
Manually issue a CQL query to add a column in the table: “ALTER TABLE 
test_ks.table1 ADD new_col text;”

{*}Symmpton{*}:
Seconds after the trigger point, one or more Cassandra nodes will show number 
of native_transport threads reaching {{{}native_transport_max_threads{}}}, and 
pending native transport tasks grow endlessly.

> Race condition in QueryProcessor causes just prepared statement not to be in 
> the prepared statements cache
> --
>
> Key: CASSANDRA-17401
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17401
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Ivan Senic
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The changes in the 
> 

[jira] [Comment Edited] (CASSANDRA-17401) Race condition in QueryProcessor causes just prepared statement not to be in the prepared statements cache

2024-01-21 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-17401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809211#comment-17809211
 ] 

Paulo Motta edited comment on CASSANDRA-17401 at 1/22/24 1:54 AM:
--

Ok thanks [~chovatia.jayd...@gmail.com]! I'm not familiar with this area but 
will try to look at it if I find cycles in the next few days and nobody beats 
me to it. :)

Btw did you observe a single occurrence of this issue, or is it recurrent?


was (Author: paulo):
Ok thanks [~chovatia.jayd...@gmail.com]! I'm not familiar with this area but 
will try to look at it if I find cycles in the next few days and nobody beats 
me to it. :)

Btw did you just observe a single occurrence of this issue, or is it recurrent?

> Race condition in QueryProcessor causes just prepared statement not to be in 
> the prepared statements cache
> --
>
> Key: CASSANDRA-17401
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17401
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Ivan Senic
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The changes in the 
> [QueryProcessor#prepare|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L575-L638]
>  method that were introduced in versions *4.0.2* and *3.11.12* can cause a 
> race condition between two threads trying to concurrently prepare the same 
> statement. This race condition can cause removing of a prepared statement 
> from the cache, after one of the threads has received the result of the 
> prepare and eventually uses MD5Digest to call 
> [QueryProcessor#getPrepared|https://github.com/apache/cassandra/blame/cassandra-4.0.2/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L212-L215].
> The race condition looks like this:
>  * Thread1 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 executes eviction of hashes
>  * Thread2 enters _prepare_ method and resolves _safeToReturnCached_ as false
>  * Thread1 prepares the statement and caches it
>  * Thread1 returns the result of the prepare
>  * Thread2 executes eviction of hashes
>  * Thread1 tries to execute the prepared statement with the received 
> MD5Digest, but statement is not in the cache as it was evicted by Thread2
> I tried to reproduce this by using a Java driver, but hitting this case from 
> a client side is highly unlikely and I can not simulate the needed race 
> condition. However, we can easily reproduce this in Stargate (details 
> [here|https://github.com/stargate/stargate/pull/1647]), as it's closer to 
> QueryProcessor.
> Reproducing this in a unit test is fairly easy. I am happy to showcase this 
> if needed.
> Note that the issue can occur only when  safeToReturnCached is resolved as 
> false.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org