date:20240423



absurdfarce merged PR #1924:
URL: https://github.com/apache/cassandra-java-driver/pull/1924


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

(cassandra-java-driver) branch 4.x updated: CASSANDRA-19292: Enable Jenkins to test against Cassandra 4.1.x

2024-04-23 Thread absurdfarce

This is an automated email from the ASF dual-hosted git repository.

absurdfarce pushed a commit to branch 4.x
in repository https://gitbox.apache.org/repos/asf/cassandra-java-driver.git


The following commit(s) were added to refs/heads/4.x by this push:
 new 1492d6ced CASSANDRA-19292: Enable Jenkins to test against Cassandra 
4.1.x
1492d6ced is described below

commit 1492d6ced9d54bdd68deb043a0bfe232eaa2a8fc
Author: absurdfarce 
AuthorDate: Fri Mar 29 00:46:46 2024 -0500

CASSANDRA-19292: Enable Jenkins to test against Cassandra 4.1.x

patch by Bret McGuire; reviewed by Bret McGuire, Alexandre Dutra for 
CASSANDRA-19292
---
 Jenkinsfile| 20 ---
 .../com/datastax/oss/driver/api/core/Version.java  |  1 +
 .../oss/driver/core/metadata/SchemaIT.java | 13 +
 .../oss/driver/api/testinfra/ccm/CcmBridge.java| 61 --
 4 files changed, 86 insertions(+), 9 deletions(-)

diff --git a/Jenkinsfile b/Jenkinsfile
index 8d2b74c5b..0bfa4ca7f 100644
--- a/Jenkinsfile
+++ b/Jenkinsfile
@@ -256,8 +256,10 @@ pipeline {
   choices: ['2.1',   // Legacy Apache CassandraⓇ
 '2.2',   // Legacy Apache CassandraⓇ
 '3.0',   // Previous Apache CassandraⓇ
-'3.11',  // Current Apache CassandraⓇ
-'4.0',   // Development Apache CassandraⓇ
+'3.11',  // Previous Apache CassandraⓇ
+'4.0',   // Previous Apache CassandraⓇ
+'4.1',   // Current Apache CassandraⓇ
+'5.0',   // Development Apache CassandraⓇ
 'dse-4.8.16',   // Previous EOSL DataStax Enterprise
 'dse-5.0.15',   // Long Term Support DataStax Enterprise
 'dse-5.1.35',   // Legacy DataStax Enterprise
@@ -291,7 +293,11 @@ pipeline {
 
 
   4.0
-  Apache Cassandra v4.x (CURRENTLY UNDER 
DEVELOPMENT)
+  Apache Cassandra v4.0.x
+
+
+  4.1
+  Apache Cassandra v4.1.x
 
 
   dse-4.8.16
@@ -445,7 +451,7 @@ pipeline {
   axis {
 name 'SERVER_VERSION'
 values '3.11', // Latest stable Apache CassandraⓇ
-   '4.0',  // Development Apache CassandraⓇ
+   '4.1',  // Development Apache CassandraⓇ
'dse-6.8.30' // Current DataStax Enterprise
   }
   axis {
@@ -554,8 +560,10 @@ pipeline {
 name 'SERVER_VERSION'
 values '2.1',   // Legacy Apache CassandraⓇ
'3.0',   // Previous Apache CassandraⓇ
-   '3.11',  // Current Apache CassandraⓇ
-   '4.0',   // Development Apache CassandraⓇ
+   '3.11',  // Previous Apache CassandraⓇ
+   '4.0',   // Previous Apache CassandraⓇ
+   '4.1',   // Current Apache CassandraⓇ
+   '5.0',   // Development Apache CassandraⓇ
'dse-4.8.16',   // Previous EOSL DataStax Enterprise
'dse-5.0.15',   // Last EOSL DataStax Enterprise
'dse-5.1.35',   // Legacy DataStax Enterprise
diff --git a/core/src/main/java/com/datastax/oss/driver/api/core/Version.java 
b/core/src/main/java/com/datastax/oss/driver/api/core/Version.java
index cc4931fe2..3f12c54fa 100644
--- a/core/src/main/java/com/datastax/oss/driver/api/core/Version.java
+++ b/core/src/main/java/com/datastax/oss/driver/api/core/Version.java
@@ -52,6 +52,7 @@ public class Version implements Comparable, 
Serializable {
   @NonNull public static final Version V2_2_0 = 
Objects.requireNonNull(parse("2.2.0"));
   @NonNull public static final Version V3_0_0 = 
Objects.requireNonNull(parse("3.0.0"));
   @NonNull public static final Version V4_0_0 = 
Objects.requireNonNull(parse("4.0.0"));
+  @NonNull public static final Version V4_1_0 = 
Objects.requireNonNull(parse("4.1.0"));
   @NonNull public static final Version V5_0_0 = 
Objects.requireNonNull(parse("5.0.0"));
   @NonNull public static final Version V6_7_0 = 
Objects.requireNonNull(parse("6.7.0"));
   @NonNull public static final Version V6_8_0 = 
Objects.requireNonNull(parse("6.8.0"));
diff --git 
a/integration-tests/src/test/java/com/datastax/oss/driver/core/metadata/SchemaIT.java
 
b/integration-tests/src/test/java/com/datastax/oss/driver/core/metadata/SchemaIT.java
index caa96a647..6495b451d 100644
--- 
a/integration-tests/src/test/java/com/datastax/oss/driver/core/metadata/SchemaIT.java
+++ 
b/integration-tests/src/test/java/com/datastax/oss/driver/core/metadata/SchemaIT.java
@@ -265,6 +265,19 @@ public class SchemaIT {
   + "total

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-23 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840266#comment-17840266
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

> If you have internode_compression=dc then replacement with the same IP will 
> not work, you need to use a different IP because the compression has already 
> been negotiated on the other nodes.

Not to get too off topic to the issue at hand but I am able todo replacement 
with same IP with internode compression enabled. So what doesn't work about 
this?

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840267#comment-17840267
 ] 

Brandon Williams commented on CASSANDRA-19580:
--

Set compression to all so there are no special cases and test again.

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19585) syntax formatting on CQL doc is garbled



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-19585:
-
 Bug Category: Parent values: Correctness(12982)
   Complexity: Normal
Discovered By: User Report
 Severity: Normal
   Status: Open  (was: Triage Needed)

> syntax formatting on CQL doc is garbled
> ---
>
> Key: CASSANDRA-19585
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19585
> Project: Cassandra
>  Issue Type: Bug
>  Components: Documentation/Website
>Reporter: Jon Haddad
>Priority: Normal
> Attachments: image-2024-04-23-17-37-54-438.png
>
>
> It looks like the build process for the 4.1 docs isn't correctly processed.  
> Screenshot attached.
> https://cassandra.apache.org/doc/4.1/cassandra/cql/cql_singlefile.html#alterTableStmt
>  !image-2024-04-23-17-37-54-438.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-19585) syntax formatting on CQL doc is garbled

Jon Haddad created CASSANDRA-19585:
--

 Summary: syntax formatting on CQL doc is garbled
 Key: CASSANDRA-19585
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19585
 Project: Cassandra
  Issue Type: Bug
  Components: Documentation/Website
Reporter: Jon Haddad
 Attachments: image-2024-04-23-17-37-54-438.png

It looks like the build process for the 4.1 docs isn't correctly processed.  
Screenshot attached.

https://cassandra.apache.org/doc/4.1/cassandra/cql/cql_singlefile.html#alterTableStmt

 !image-2024-04-23-17-37-54-438.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-19584) Glossary labeled as DataStax glossary

Jon Haddad created CASSANDRA-19584:
--

 Summary: Glossary labeled as DataStax glossary
 Key: CASSANDRA-19584
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19584
 Project: Cassandra
  Issue Type: Bug
  Components: Documentation/Website
Reporter: Jon Haddad


Should be Cassandra glossary

https://cassandra.apache.org/_/glossary.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19583) setting compaction throughput to 0 throws a startup error



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840253#comment-17840253
 ] 

Jon Haddad commented on CASSANDRA-19583:


I found it testing 5.0, but I'm guessing it's present since whenever we updated 
the config.  This is with the new compaction_throughput setting, not 
compaction_throughput_in_mb.

> setting compaction throughput to 0 throws a startup error
> -
>
> Key: CASSANDRA-19583
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19583
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Jon Haddad
>Priority: Normal
>
> The inline docs say:
> {noformat}
> Setting this to 0 disables throttling.
> {noformat}
> However, on startup, we throw this error:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
> units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames 
> omitted
> {noformat}
> We should allow 0 as per the inline doc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19583) setting compaction throughput to 0 throws a startup error



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840251#comment-17840251
 ] 

Brandon Williams commented on CASSANDRA-19583:
--

Which version was this?

> setting compaction throughput to 0 throws a startup error
> -
>
> Key: CASSANDRA-19583
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19583
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Jon Haddad
>Priority: Normal
>
> The inline docs say:
> {noformat}
> Setting this to 0 disables throttling.
> {noformat}
> However, on startup, we throw this error:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
> units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames 
> omitted
> {noformat}
> We should allow 0 as per the inline doc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19583) setting compaction throughput to 0 throws a startup error



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jon Haddad updated CASSANDRA-19583:
---
Description: 
The inline docs say:


{noformat}
Setting this to 0 disables throttling.
{noformat}

However, on startup, we throw this error:


{noformat}
Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames omitted
{noformat}

We should allow 0 as per the inline doc.

  was:
The inline docs say:


{noformat}
Setting this to 0 disables throttling.
{noformat}

However, on startup, we throw this error:


{noformat}
Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames omitted
{noformat}




> setting compaction throughput to 0 throws a startup error
> -
>
> Key: CASSANDRA-19583
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19583
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Jon Haddad
>Priority: Normal
>
> The inline docs say:
> {noformat}
> Setting this to 0 disables throttling.
> {noformat}
> However, on startup, we throw this error:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
> units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
> org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
> Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames 
> omitted
> {noformat}
> We should allow 0 as per the inline doc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-19583) setting compaction throughput to 0 throws a startup error

Jon Haddad created CASSANDRA-19583:
--

 Summary: setting compaction throughput to 0 throws a startup error
 Key: CASSANDRA-19583
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19583
 Project: Cassandra
  Issue Type: Bug
  Components: Local/Config
Reporter: Jon Haddad


The inline docs say:


{noformat}
Setting this to 0 disables throttling.
{noformat}

However, on startup, we throw this error:


{noformat}
Caused by: java.lang.IllegalArgumentException: Invalid data rate: 0 Accepted 
units: MiB/s, KiB/s, B/s where case matters and only non-negative values a>
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:52)
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec.(DataRateSpec.java:61)
Apr 23 23:12:01 cassandra0 cassandra[3424]: at 
org.apache.cassandra.config.DataRateSpec$LongBytesPerSecondBound.(DataRateSpec.java:232)
Apr 23 23:12:01 cassandra0 cassandra[3424]: ... 27 common frames omitted
{noformat}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19580) Unable to contact any seeds with node in hibernate status

2024-04-23 Thread Cameron Zemek (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840241#comment-17840241
 ] 

Cameron Zemek commented on CASSANDRA-19580:
---

Yeah so what breaks if use same state as when replacing with different address? 
I looked through CASSANDRA-8523 and didn't understand what different about 
replacing when reusing the same IP address. Why isn't the node in UJ state when 
doing replacements, that is receiving writes but not reads.

What do you think would be the correct fix here? Is sending an empty SYN like 
shadow round okay? Why does examineGossiper not send back states for missing 
digests (it only compares for the digests in the SYN)?

Considering that SYN messages are sent randomly, it seems like could also end 
up with this 'Unable to contact any seeds!' path if none of the nodes randomly 
pick the replacement node to send a SYN to.

> Unable to contact any seeds with node in hibernate status
> -
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Cameron Zemek
>Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I 
> have been able to reproduce this issue if I kill Cassandra as its joining 
> which will put the node into hibernate status. Once a node is in hibernate it 
> will no longer receive any SYN messages from other nodes during startup and 
> as it sends only itself as digest in outbound SYN messages it never receives 
> any states in any of the ACK replies. So once it gets to the check 
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>  
> A workaround is copying the system.peers table from other node but this is 
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
>     /* Possibly gossip to a seed for facilitating partition healing */
>     private void maybeGossipToSeed(MessageOut prod)
>     {
>         int size = seeds.size();
>         if (size > 0)
>         {
>             if (size == 1 && 
> seeds.contains(FBUtilities.getBroadcastAddress()))
>             {
>                 return;
>             }
>             if (liveEndpoints.size() == 0)
>             {
>                 List gDigests = prod.payload.gDigests;
>                 if (gDigests.size() == 1 && 
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
>                 {
>                     gDigests = new ArrayList();
>                     GossipDigestSyn digestSynMessage = new 
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>                                                                            
> DatabaseDescriptor.getPartitionerName(),
>                                                                            
> gDigests);
>                     MessageOut message = new 
> MessageOut(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>                                                                               
>             digestSynMessage,
>                                                                               
>             GossipDigestSyn.serializer);
>                     sendGossip(message, seeds);
>                 }
>                 else
>                 {
>                     sendGossip(prod, seeds);
>                 }
>             }
>             else
>             {
>                 /* Gossip with the seed with some probability. */
>                 double probability = seeds.size() / (double) 
> (liveEndpoints.size() + unreachableEndpoints.size());
>                 double randDbl = random.nextDouble();
>                 if (randDbl <= probability)
>                     sendGossip(prod, seeds);
>             }
>         }
>     }
>  {code}
> Only problem is this is the same as SYN from shadow round. It does resolve 
> the issue however as then receive an ACK with all the states.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15439) Token metadata for bootstrapping nodes is lost under temporary failures



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840228#comment-17840228
 ] 

Brandon Williams commented on CASSANDRA-15439:
--

With the failed bootstrap timeout separated out, we could take this opportunity 
to also increase it to give users some protection from the scenario you ran 
into by default, and also aid resumable bootstrap.  WDYT? /cc [~paulo]

> Token metadata for bootstrapping nodes is lost under temporary failures
> ---
>
> Key: CASSANDRA-15439
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15439
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Membership
>Reporter: Josh Snyder
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In CASSANDRA-8838, [~pauloricardomg] asked "hints will not be stored to the 
> bootstrapping node after RING_DELAY, since it will evicted from the TMD 
> pending ranges. Should we create a ticket to address this?"
> CASSANDRA-15264 relates to the most likely cause of such situations, where 
> the Cassandra daemon on the bootstrapping node completely crashes. Based on 
> testing with {{kill -STOP}} on a bootstrapping Cassandra JVM, I believe it 
> also is possible to remove token metadata (and thus pending ranges, and thus 
> hints) for a bootstrapping node, simply by affecting its status in the 
> failure detector. 
> A node in the cluster sees the bootstrapping node this way:
> {noformat}
> INFO  [GossipStage:1] 2019-11-27 20:41:41,101 Gossiper.java: - Node 
> /PUBLIC-IP is now part of the cluster
> INFO  [GossipStage:1] 2019-11-27 20:41:41,199 Gossiper.java:1073 - 
> InetAddress /PUBLIC-IP is now UP
> INFO  [HANDSHAKE-/PRIVATE-IP] 2019-11-27 20:41:41,412 
> OutboundTcpConnection.java:565 - Handshaking version with /PRIVATE-IP
> INFO  [STREAM-INIT-/PRIVATE-IP:21233] 2019-11-27 20:42:10,019 
> StreamResultFuture.java:112 - [Stream #6219a950-1156-11ea-b45d-4d30364576c4 
> ID#0] Creating new streaming plan for Bootstrap
> INFO  [STREAM-INIT-/PRIVATE-IP:21233] 2019-11-27 20:42:10,020 
> StreamResultFuture.java:119 - [Stream #6219a950-1156-11ea-b45d-4d30364576c4, 
> ID#0] Received streaming plan for Bootstrap
> INFO  [STREAM-INIT-/PRIVATE-IP:56003] 2019-11-27 20:42:10,112 
> StreamResultFuture.java:119 - [Stream #6219a950-1156-11ea-b45d-4d30364576c4, 
> ID#0] Received streaming plan for Bootstrap
> INFO  [STREAM-IN-/PUBLIC-IP] 2019-11-27 20:42:10,179 
> StreamResultFuture.java:169 - [Stream #6219a950-1156-11ea-b45d-4d30364576c4 
> ID#0] Prepare completed. Receiving 0 files(0 bytes), sending 833 
> files(139744616815 bytes)
> INFO  [GossipStage:1] 2019-11-27 20:54:47,547 Gossiper.java:1089 - 
> InetAddress /PUBLIC-IP is now DOWN
> INFO  [GossipTasks:1] 2019-11-27 20:54:57,551 Gossiper.java:849 - FatClient 
> /PUBLIC-IP has been silent for 3ms, removing from gossip
> {noformat}
> Since the bootstrapping node has no tokens, it is treated like a fat client, 
> and it is removed from the ring. For correctness purposes, I believe we must 
> keep storing hints for the downed bootstrapping node until it is either 
> assassinated or until a replacement attempts to bootstrap for the same token.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15439) Token metadata for bootstrapping nodes is lost under temporary failures



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-15439:
-
Fix Version/s: (was: 3.0.x)
   (was: 3.11.x)
   (was: 5.x)

> Token metadata for bootstrapping nodes is lost under temporary failures
> ---
>
> Key: CASSANDRA-15439
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15439
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Membership
>Reporter: Josh Snyder
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In CASSANDRA-8838, [~pauloricardomg] asked "hints will not be stored to the 
> bootstrapping node after RING_DELAY, since it will evicted from the TMD 
> pending ranges. Should we create a ticket to address this?"
> CASSANDRA-15264 relates to the most likely cause of such situations, where 
> the Cassandra daemon on the bootstrapping node completely crashes. Based on 
> testing with {{kill -STOP}} on a bootstrapping Cassandra JVM, I believe it 
> also is possible to remove token metadata (and thus pending ranges, and thus 
> hints) for a bootstrapping node, simply by affecting its status in the 
> failure detector. 
> A node in the cluster sees the bootstrapping node this way:
> {noformat}
> INFO  [GossipStage:1] 2019-11-27 20:41:41,101 Gossiper.java: - Node 
> /PUBLIC-IP is now part of the cluster
> INFO  [GossipStage:1] 2019-11-27 20:41:41,199 Gossiper.java:1073 - 
> InetAddress /PUBLIC-IP is now UP
> INFO  [HANDSHAKE-/PRIVATE-IP] 2019-11-27 20:41:41,412 
> OutboundTcpConnection.java:565 - Handshaking version with /PRIVATE-IP
> INFO  [STREAM-INIT-/PRIVATE-IP:21233] 2019-11-27 20:42:10,019 
> StreamResultFuture.java:112 - [Stream #6219a950-1156-11ea-b45d-4d30364576c4 
> ID#0] Creating new streaming plan for Bootstrap
> INFO  [STREAM-INIT-/PRIVATE-IP:21233] 2019-11-27 20:42:10,020 
> StreamResultFuture.java:119 - [Stream #6219a950-1156-11ea-b45d-4d30364576c4, 
> ID#0] Received streaming plan for Bootstrap
> INFO  [STREAM-INIT-/PRIVATE-IP:56003] 2019-11-27 20:42:10,112 
> StreamResultFuture.java:119 - [Stream #6219a950-1156-11ea-b45d-4d30364576c4, 
> ID#0] Received streaming plan for Bootstrap
> INFO  [STREAM-IN-/PUBLIC-IP] 2019-11-27 20:42:10,179 
> StreamResultFuture.java:169 - [Stream #6219a950-1156-11ea-b45d-4d30364576c4 
> ID#0] Prepare completed. Receiving 0 files(0 bytes), sending 833 
> files(139744616815 bytes)
> INFO  [GossipStage:1] 2019-11-27 20:54:47,547 Gossiper.java:1089 - 
> InetAddress /PUBLIC-IP is now DOWN
> INFO  [GossipTasks:1] 2019-11-27 20:54:57,551 Gossiper.java:849 - FatClient 
> /PUBLIC-IP has been silent for 3ms, removing from gossip
> {noformat}
> Since the bootstrapping node has no tokens, it is treated like a fat client, 
> and it is removed from the ring. For correctness purposes, I believe we must 
> keep storing hints for the downed bootstrapping node until it is either 
> assassinated or until a replacement attempts to bootstrap for the same token.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19572) Test failure: org.apache.cassandra.db.ImportTest flakiness



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840220#comment-17840220
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19572 at 4/23/24 8:28 PM:


OK so more digging ... I was trying to put into each afterTest 
"SSTableReader.resetTidying();" and it did help, below is each job with 5k 
repetitions.

4.0 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4223/workflows/a82d0483-a0df-44ed-8127-088b303c78ba/jobs/225432/steps
4.1 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4224/workflows/eae7a5e2-89dd-46cd-aaca-1e4250d0fa8b/jobs/225531/steps
5.0 j11 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225728/steps
5.0 j17 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225727/steps

However, I just noticed that there is already afterTest in CQLTester which 
ImportTest extends and I was _not_ calling it (super.afterTest()) in my 
afterTest. What CQLTester's afterTest does is this (1). It removes the tables 
and it deletes all SSTables on the disk, so I guess it also calls tidying, just 
by other means, but that whole operation runs in 
ScheduledExecutors.optionalTasks which is asynchronous.

So, what happens, when we run a test method, then afterTest is invoked and 
removal is done asynchronously? Then JUnit does not wait until it is finished, 
right? I think that this work then might leak beyond the scope of afterTest and 
a new test is run etc ... I feel uneasy about this and that is probably the 
real cause of the issues we see when it comes to these refs. 

What I am doing right now is that I am tidying it up before calling 
super.afterTest and I run multiplex on 4.0 again. If it fails, I guess the next 
step will be to run the logic in afterTest synchronously.

(1) 
https://github.com/apache/cassandra/blob/cassandra-4.1/test/unit/org/apache/cassandra/cql3/CQLTester.java#L417-L433


was (Author: smiklosovic):
OK so more digging ... I was trying to put into each afterTest 
"SSTableReader.resetTidying();" and it did help, below is each job with 5k 
repetitions.

4.0 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4223/workflows/a82d0483-a0df-44ed-8127-088b303c78ba/jobs/225432/steps
4.1 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4224/workflows/eae7a5e2-89dd-46cd-aaca-1e4250d0fa8b/jobs/225531/steps
5.0 j11 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225728/steps
5.0 j17 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225727/steps

However, I just noticed that there is already afterTest in CQLTester which 
ImportTest extends and I was _not_ calling it (super.afterTest()) in my 
afterTest. What CQLTester's afterTest does is this (1). It removes the tables 
and it deletes all SSTables on the disk, so I guess it also calls tidying, just 
by other means, but that whole operation runs in 
ScheduledExecutors.optionalTasks which is asynchronous.

So, what happens, when we run a test method, then afterTest is invoked and 
removal is done asynchronously? Then JUnit does not wait until is is finished, 
right? I think that this work then might leak beyond the scope of afterTest and 
a new test is run etc ... I feel uneasy about this and that is probably the 
real cause of the issues we see when it comes to these refs. 

What I am doing right now is that I am tidying it up before calling 
super.afterTest and I run multiplex on 4.0 again. If it fails, I guess the next 
step will be to run the logic in afterTest synchronously.

(1) 
https://github.com/apache/cassandra/blob/cassandra-4.1/test/unit/org/apache/cassandra/cql3/CQLTester.java#L417-L433

> Test failure: org.apache.cassandra.db.ImportTest flakiness
> --
>
> Key: CASSANDRA-19572
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19572
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Brandon Williams
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> As discovered on CASSANDRA-19401, the tests in this class are flaky, at least 
> the following:
>  * testImportCorruptWithoutValidationWithCopying
>  * testImportInvalidateCache
>  * testImportCorruptWithCopying
>  * testImportCacheEnabledWithoutSrcDir
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/4199/workflows/a70b41d8-f848-4114-9349-9a01ac082281/jobs/223621/tests]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (CASSANDRA-19572) Test failure: org.apache.cassandra.db.ImportTest flakiness



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840220#comment-17840220
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19572 at 4/23/24 8:27 PM:


OK so more digging ... I was trying to put into each afterTest 
"SSTableReader.resetTidying();" and it did help, below is each job with 5k 
repetitions.

4.0 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4223/workflows/a82d0483-a0df-44ed-8127-088b303c78ba/jobs/225432/steps
4.1 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4224/workflows/eae7a5e2-89dd-46cd-aaca-1e4250d0fa8b/jobs/225531/steps
5.0 j11 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225728/steps
5.0 j17 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225727/steps

However, I just noticed that there is already afterTest in CQLTester which 
ImportTest extends and I was _not_ calling it (super.afterTest()) in my 
afterTest. What CQLTester's afterTest does is this (1). It removes the tables 
and it deletes all SSTables on the disk, so I guess it also calls tidying, just 
by other means, but that whole operation runs in 
ScheduledExecutors.optionalTasks which is asynchronous.

So, what happens, when we run a test method, then afterTest is invoked and 
removal is done asynchronously? Then JUnit does not wait until is is finished, 
right? I think that this work then might leak beyond the scope of afterTest and 
a new test is run etc ... I feel uneasy about this and that is probably the 
real cause of the issues we see when it comes to these refs. 

What I am doing right now is that I am tidying it up before calling 
super.afterTest and I run multiplex on 4.0 again. If it fails, I guess the next 
step will be to run the logic in afterTest synchronously.

(1) 
https://github.com/apache/cassandra/blob/cassandra-4.1/test/unit/org/apache/cassandra/cql3/CQLTester.java#L417-L433


was (Author: smiklosovic):
OK so more digging ... I was trying to put into each afterTest 
"SSTableReader.resetTidying();" and it did help, below is each job with 5k 
repetitions.

4.0 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4223/workflows/a82d0483-a0df-44ed-8127-088b303c78ba/jobs/225432/steps
4.1 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4224/workflows/eae7a5e2-89dd-46cd-aaca-1e4250d0fa8b/jobs/225531/steps
5.0 j11 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225728/steps
5.0 j17 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225727/steps

However, I just noticed that there is already afterTest in CQLTester which 
ImportTest extends and I was _not_ calling it (super.afterTest()) in my 
afterTest. What CQLTester's afterTest does is this (1). It removes the tables 
and it deletes all SSTables on the disk, so I guess it also calls tidying, just 
by other means, but that whole operation runs in 
ScheduledExecutors.optionalTasks which is asynchronous.

So, what happens, when we run a test method, then afterTest is invoked and 
removal is done asynchronously? Then JUnit does not wait until is is finished, 
right? I think that this work then might leak beyond the scope of afterTest and 
a new test is run etc ... I fee uneasy about this and that is probably the real 
cause of the issues we see when it comes to these refs. 

What I am doing right now is that I am tidying it up before calling 
super.afterTest and I run multiplex on 4.0 again. If it fails, I guess the next 
step will be to run the logic in afterTest synchronously.

(1) 
https://github.com/apache/cassandra/blob/cassandra-4.1/test/unit/org/apache/cassandra/cql3/CQLTester.java#L417-L433

> Test failure: org.apache.cassandra.db.ImportTest flakiness
> --
>
> Key: CASSANDRA-19572
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19572
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Brandon Williams
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> As discovered on CASSANDRA-19401, the tests in this class are flaky, at least 
> the following:
>  * testImportCorruptWithoutValidationWithCopying
>  * testImportInvalidateCache
>  * testImportCorruptWithCopying
>  * testImportCacheEnabledWithoutSrcDir
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/4199/workflows/a70b41d8-f848-4114-9349-9a01ac082281/jobs/223621/tests]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (CASSANDRA-19572) Test failure: org.apache.cassandra.db.ImportTest flakiness



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840220#comment-17840220
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19572 at 4/23/24 8:27 PM:


OK so more digging ... I was trying to put into each afterTest 
"SSTableReader.resetTidying();" and it did help, below is each job with 5k 
repetitions.

4.0 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4223/workflows/a82d0483-a0df-44ed-8127-088b303c78ba/jobs/225432/steps
4.1 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4224/workflows/eae7a5e2-89dd-46cd-aaca-1e4250d0fa8b/jobs/225531/steps
5.0 j11 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225728/steps
5.0 j17 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225727/steps

However, I just noticed that there is already afterTest in CQLTester which 
ImportTest extends and I was _not_ calling it (super.afterTest()) in my 
afterTest. What CQLTester's afterTest does is this (1). It removes the tables 
and it deletes all SSTables on the disk, so I guess it also calls tidying, just 
by other means, but that whole operation runs in 
ScheduledExecutors.optionalTasks which is asynchronous.

So, what happens, when we run a test method, then afterTest is invoked and 
removal is done asynchronously? Then JUnit does not wait until is is finished, 
right? I think that this work then might leak beyond the scope of afterTest and 
a new test is run etc ... I fee uneasy about this and that is probably the real 
cause of the issues we see when it comes to these refs. 

What I am doing right now is that I am tidying it up before calling 
super.afterTest and I run multiplex on 4.0 again. If it fails, I guess the next 
step will be to run the logic in afterTest synchronously.

(1) 
https://github.com/apache/cassandra/blob/cassandra-4.1/test/unit/org/apache/cassandra/cql3/CQLTester.java#L417-L433


was (Author: smiklosovic):
OK so more digging ... I was trying to put into each afterTest 
"SSTableReader.resetTidying();" and it did help, below is each job with 5k 
repetitions.

4.0 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4223/workflows/a82d0483-a0df-44ed-8127-088b303c78ba/jobs/225432/steps
4.1 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4224/workflows/eae7a5e2-89dd-46cd-aaca-1e4250d0fa8b/jobs/225531/steps
5.0 j11 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225728/steps
5.0 j17 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225727/steps

However, I just noticed that there is already afterTest in CQLTester which 
ImportTest extends and I was _not_ calling it (super.afterTest()) in my 
afterTest. What CQLTester's afterTest does is this (1). It removes the tables 
and it deletes all SSTables on the disk, so I guess it also calls tidying, just 
by other means, but that whole operation runs in 
ScheduledExecutors.optionalTasks which is asynchronous.

So, what happens, when we run a test method, then afterTest is invoked and 
removal is done asynchronously? Then JUnit does not wait until is is finished, 
right? I think that this work then might leaks beyond the scope of afterTest 
and a new test is run etc ... I fee uneasy about this and that is probably the 
real cause of the issues we see when it comes to these refs. 

What I am doing right now is that I am tidying it up before calling 
super.afterTest and I run multiplex on 4.0 again. If it fails, I guess the next 
step will be to run the logic in afterTest synchronously.

(1) 
https://github.com/apache/cassandra/blob/cassandra-4.1/test/unit/org/apache/cassandra/cql3/CQLTester.java#L417-L433

> Test failure: org.apache.cassandra.db.ImportTest flakiness
> --
>
> Key: CASSANDRA-19572
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19572
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Brandon Williams
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> As discovered on CASSANDRA-19401, the tests in this class are flaky, at least 
> the following:
>  * testImportCorruptWithoutValidationWithCopying
>  * testImportInvalidateCache
>  * testImportCorruptWithCopying
>  * testImportCacheEnabledWithoutSrcDir
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/4199/workflows/a70b41d8-f848-4114-9349-9a01ac082281/jobs/223621/tests]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CASSANDRA-19572) Test failure: org.apache.cassandra.db.ImportTest flakiness



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840220#comment-17840220
 ] 

Stefan Miklosovic commented on CASSANDRA-19572:
---

OK so more digging ... I was trying to put into each afterTest 
"SSTableReader.resetTidying();" and it did help, below is each job with 5k 
repetitions.

4.0 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4223/workflows/a82d0483-a0df-44ed-8127-088b303c78ba/jobs/225432/steps
4.1 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4224/workflows/eae7a5e2-89dd-46cd-aaca-1e4250d0fa8b/jobs/225531/steps
5.0 j11 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225728/steps
5.0 j17 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4226/workflows/9805ec75-fd02-4c5a-8996-fa5bce71e8c2/jobs/225727/steps

However, I just noticed that there is already afterTest in CQLTester which 
ImportTest extends and I was _not_ calling it (super.afterTest()) in my 
afterTest. What CQLTester's afterTest does is this (1). It removes the tables 
and it deletes all SSTables on the disk, so I guess it also calls tidying, just 
by other means, but that whole operation runs in 
ScheduledExecutors.optionalTasks which is asynchronous.

So, what happens, when we run a test method, then afterTest is invoked and 
removal is done asynchronously? Then JUnit does not wait until is is finished, 
right? I think that this work then might leaks beyond the scope of afterTest 
and a new test is run etc ... I fee uneasy about this and that is probably the 
real cause of the issues we see when it comes to these refs. 

What I am doing right now is that I am tidying it up before calling 
super.afterTest and I run multiplex on 4.0 again. If it fails, I guess the next 
step will be to run the logic in afterTest synchronously.

(1) 
https://github.com/apache/cassandra/blob/cassandra-4.1/test/unit/org/apache/cassandra/cql3/CQLTester.java#L417-L433

> Test failure: org.apache.cassandra.db.ImportTest flakiness
> --
>
> Key: CASSANDRA-19572
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19572
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Brandon Williams
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> As discovered on CASSANDRA-19401, the tests in this class are flaky, at least 
> the following:
>  * testImportCorruptWithoutValidationWithCopying
>  * testImportInvalidateCache
>  * testImportCorruptWithCopying
>  * testImportCacheEnabledWithoutSrcDir
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/4199/workflows/a70b41d8-f848-4114-9349-9a01ac082281/jobs/223621/tests]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18130) Log hardware and container params during test runs to help troubleshoot intermittent failures



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-18130:
---
Fix Version/s: 2.2.20
   3.0.31
   3.11.18
   4.0.13
   4.1.5
   5.0-beta2
   5.0
   5.1
   (was: 5.x)
   (was: 4.0.x)
   (was: 4.1.x)

> Log hardware and container params during test runs to help troubleshoot 
> intermittent failures
> -
>
> Key: CASSANDRA-18130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18130
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java, Test/dtest/python, Test/unit
>Reporter: Josh McKenzie
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 2.2.20, 3.0.31, 3.11.18, 4.0.13, 4.1.5, 5.0-beta2, 5.0, 
> 5.1
>
>
> {color:#00}We’ve long had flakiness in our containerized ASF CI 
> environment that we don’t see in circleci. The environment itself is both 
> containerized and heterogenous, so there are differences in both the hardware 
> environment and the software environment in which it executes. For reference, 
> see: 
> [https://github.com/apache/cassandra-builds/blob/trunk/ASF-jenkins-agents.md#current-agents]{color}
> {color:#00} {color}
> {color:#00}We should log a variety of hardware, container, and software 
> environment details to help get to the bottom of where some test failures may 
> be occurring. As we don’t have shell access to the machines it’ll be easier 
> to have this information logged / retrieved during test runs than to try and 
> profile each host independently.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18130) Log hardware and container params during test runs to help troubleshoot intermittent failures



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-18130:
---
Resolution: Fixed
Status: Resolved  (was: Open)

Committed as 
https://github.com/apache/cassandra-builds/commit/eab310bd76329be5d47c7a8c4e8837bbb3e2fff0
as part of CASSANDRA-19558

> Log hardware and container params during test runs to help troubleshoot 
> intermittent failures
> -
>
> Key: CASSANDRA-18130
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18130
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java, Test/dtest/python, Test/unit
>Reporter: Josh McKenzie
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.x
>
>
> {color:#00}We’ve long had flakiness in our containerized ASF CI 
> environment that we don’t see in circleci. The environment itself is both 
> containerized and heterogenous, so there are differences in both the hardware 
> environment and the software environment in which it executes. For reference, 
> see: 
> [https://github.com/apache/cassandra-builds/blob/trunk/ASF-jenkins-agents.md#current-agents]{color}
> {color:#00} {color}
> {color:#00}We should log a variety of hardware, container, and software 
> environment details to help get to the bottom of where some test failures may 
> be occurring. As we don’t have shell access to the machines it’ll be easier 
> to have this information logged / retrieved during test runs than to try and 
> profile each host independently.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19582) [Analytics] Consume new Sidecar client API to stream SSTables



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francisco Guerrero updated CASSANDRA-19582:
---
  Fix Version/s: NA
Source Control Link: 
https://github.com/apache/cassandra-analytics/commit/86420f9d52991fb148b322031df55494669532d3
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

> [Analytics] Consume new Sidecar client API to stream SSTables
> -
>
> Key: CASSANDRA-19582
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19582
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Analytics Library
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Normal
> Fix For: NA
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> A new client API was recently introduced in Sidecar to stream SSTables. 
> Cassandra Analytics needs to start consuming the new API in order to take 
> advantage of the fixes when streaming SSTables from a Cassandra installation 
> with more than one data directory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

(cassandra-analytics) branch trunk updated: CASSANDRA-19582: Consume new Sidecar client API to stream SSTables (#54)

2024-04-23 Thread frankgh

This is an automated email from the ASF dual-hosted git repository.

frankgh pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra-analytics.git


The following commit(s) were added to refs/heads/trunk by this push:
 new 86420f9  CASSANDRA-19582: Consume new Sidecar client API to stream 
SSTables (#54)
86420f9 is described below

commit 86420f9d52991fb148b322031df55494669532d3
Author: Francisco Guerrero 
AuthorDate: Tue Apr 23 12:51:53 2024 -0700

CASSANDRA-19582: Consume new Sidecar client API to stream SSTables (#54)


Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRA-19582
---
 .../cassandra/spark/bulkwriter/BulkSparkConf.java  |  7 +++--
 .../spark/data/SidecarProvisionedSSTable.java  | 32 +-
 .../spark/data/SidecarProvisionedSSTableTest.java  |  1 +
 scripts/build-sidecar.sh   |  2 +-
 4 files changed, 20 insertions(+), 22 deletions(-)

diff --git 
a/cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/BulkSparkConf.java
 
b/cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/BulkSparkConf.java
index 2db19d5..8bc9a7f 100644
--- 
a/cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/BulkSparkConf.java
+++ 
b/cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/BulkSparkConf.java
@@ -29,6 +29,7 @@ import java.util.Base64;
 import java.util.Collections;
 import java.util.List;
 import java.util.Map;
+import java.util.Objects;
 import java.util.Optional;
 import java.util.Set;
 import java.util.concurrent.TimeUnit;
@@ -155,7 +156,7 @@ public class BulkSparkConf implements Serializable
 Optional sidecarPortFromOptions = 
MapUtils.getOptionalInt(options, WriterOptions.SIDECAR_PORT.name(), "sidecar 
port");
 this.userProvidedSidecarPort = sidecarPortFromOptions.isPresent() ? 
sidecarPortFromOptions.get() : getOptionalInt(SIDECAR_PORT).orElse(-1);
 this.effectiveSidecarPort = this.userProvidedSidecarPort == -1 ? 
DEFAULT_SIDECAR_PORT : this.userProvidedSidecarPort;
-this.sidecarInstancesValue = MapUtils.getOrThrow(options, 
WriterOptions.SIDECAR_INSTANCES.name(), "sidecar_instances");
+this.sidecarInstancesValue = MapUtils.getOrDefault(options, 
WriterOptions.SIDECAR_INSTANCES.name(), null);
 this.sidecarInstances = sidecarInstances();
 this.keyspace = MapUtils.getOrThrow(options, 
WriterOptions.KEYSPACE.name());
 this.table = MapUtils.getOrThrow(options, WriterOptions.TABLE.name());
@@ -264,7 +265,9 @@ public class BulkSparkConf implements Serializable
 
 protected Set buildSidecarInstances()
 {
-return Arrays.stream(sidecarInstancesValue.split(","))
+String[] split = Objects.requireNonNull(sidecarInstancesValue, "Unable 
to build sidecar instances from null value")
+.split(",");
+return Arrays.stream(split)
  .map(hostname -> new SidecarInstanceImpl(hostname, 
effectiveSidecarPort))
  .collect(Collectors.toSet());
 }
diff --git 
a/cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/data/SidecarProvisionedSSTable.java
 
b/cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/data/SidecarProvisionedSSTable.java
index 6e4ff0f..db9e2fd 100644
--- 
a/cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/data/SidecarProvisionedSSTable.java
+++ 
b/cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/data/SidecarProvisionedSSTable.java
@@ -124,7 +124,7 @@ public class SidecarProvisionedSSTable extends SSTable
 {
 return null;
 }
-return openStream(snapshotFile.fileName, snapshotFile.size, fileType);
+return openStream(snapshotFile, fileType);
 }
 
 public long length(FileType fileType)
@@ -144,20 +144,20 @@ public class SidecarProvisionedSSTable extends SSTable
 }
 
 @Nullable
-private InputStream openStream(String component, long size, FileType 
fileType)
+private InputStream openStream(ListSnapshotFilesResponse.FileInfo 
snapshotFile, FileType fileType)
 {
-if (component == null)
+if (snapshotFile == null)
 {
 return null;
 }
 
 if (fileType == FileType.COMPRESSION_INFO)
 {
-String key = String.format("%s/%s/%s/%s/%s", instance.hostname(), 
keyspace, table, snapshotName, component);
+String key = String.format("%s/%s/%s/%s/%s", instance.hostname(), 
keyspace, table, snapshotName, snapshotFile.fileName);
 byte[] bytes;
 try
 {
-bytes = COMPRESSION_CACHE.get(key, () -> 
IOUtils.toByteArray(open(component, fileType, size)));
+bytes = COMPRESSION_CACHE.get(key, () -> 
IOUtils.toByteArray(open(snapshotFile, fileType)));
 }
 catch

Re: [PR] CASSANDRA-19582: Consume new Sidecar client API to stream SSTables [cassandra-analytics]



frankgh merged PR #54:
URL: https://github.com/apache/cassandra-analytics/pull/54


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19558) Standalone jenkinsfile first round bug fixes



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-19558:
---
Source Control Link: 
https://github.com/apache/cassandra-builds/commit/5a9ba1a1962794a338cecaa7d8e7f23cd0ea09fd
 
https://github.com/apache/cassandra-builds/commit/eab310bd76329be5d47c7a8c4e8837bbb3e2fff0

> Standalone jenkinsfile first round bug fixes
> 
>
> Key: CASSANDRA-19558
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19558
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 5.0.x, 5.x
>
> Attachments:  CASSANDRA-19558_50_#5_ci_summary.html,  
> CASSANDRA-19558_50_#5_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#13_ci_summary.html, 
> CASSANDRA-19558-5.0_#13_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#16_ci_summary.html, 
> CASSANDRA-19558-5.0_#16_results_details.tar.xz, 
> CASSANDRA-19558_#8_ci_summary.html, CASSANDRA-19558_#8_results_details.tar.xz
>
>
> A few follow up improvements and bug fixes for the standalone jenkinsfile.
> - add at top a list of test failures in ci_summary.html
> - docker scripts always try to login (as base images need to be pulled too)
> - move simulator-dtests to large containers (they need 8g just heap)
> - in ubuntu2004_test.docker make sure /home/cassandra exists and has correct 
> perms (from marcuse)
> - persist the jenkinsfile parameters from run to run (important for the 
> post-commit jobs to keep their non-default branch and profile values) (was 
> CASSANDRA-19536)
> - increase jvm-dtest splits from 8 to 12
> - when on ci-cassandra, replace use of copyArtifacts in Jenkinsfile 
> generateTestReports() with manual wget of test files, allowing the summary 
> phase to be run on any agent (copyArtifact would take >4hrs otherwise) (was 
> INFRA-25694)
> - copy ci_summary.html and results_details.tar.xz to nightlies



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

(cassandra-builds) branch trunk updated: Add agent_scripts/ for reporting and cleaning agents in a jenkins installation with permament agents

2024-04-23 Thread mck

This is an automated email from the ASF dual-hosted git repository.

mck pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra-builds.git


The following commit(s) were added to refs/heads/trunk by this push:
 new eab310b  Add agent_scripts/ for reporting and cleaning agents in a 
jenkins installation with permament agents
eab310b is described below

commit eab310bd76329be5d47c7a8c4e8837bbb3e2fff0
Author: Mick Semb Wever 
AuthorDate: Sat Apr 20 22:10:49 2024 +0200

Add agent_scripts/ for reporting and cleaning agents in a jenkins 
installation with permament agents

This scripts are not embedded into the in-tree Jenkinsfile's so that they 
can be more easily edited.
(They are infrastructure related, rather than release branch related.)

Also solves CASSANDRA-18130

 patch by Mick Semb Wever; reviewed by Brandon Williams for CASSANDRA-19558
---
 jenkins-dsl/agent_scripts/agent_report.sh |  30 +++
 jenkins-dsl/agent_scripts/docker_agent_cleaner.sh |  52 +
 jenkins-dsl/agent_scripts/docker_image_pruner.py  |  62 +
 jenkins-dsl/cassandra_job_dsl_seed.groovy | 264 +-
 jenkins-dsl/cassandra_pipeline.groovy |   2 +-
 5 files changed, 257 insertions(+), 153 deletions(-)

diff --git a/jenkins-dsl/agent_scripts/agent_report.sh 
b/jenkins-dsl/agent_scripts/agent_report.sh
new file mode 100644
index 000..d190b4f
--- /dev/null
+++ b/jenkins-dsl/agent_scripts/agent_report.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Report agent hardware and storage
+#
+# CASSANDRA-18130
+
+echo ""
+echo $(date)
+echo "${JOB_NAME} ${BUILD_NUMBER} ${STAGE_NAME}"
+echo
+du -xmh / 2>/dev/null | sort -rh | head -n 30
+echo
+df -h
+echo
+docker system df -v
\ No newline at end of file
diff --git a/jenkins-dsl/agent_scripts/docker_agent_cleaner.sh 
b/jenkins-dsl/agent_scripts/docker_agent_cleaner.sh
new file mode 100644
index 000..1da2259
--- /dev/null
+++ b/jenkins-dsl/agent_scripts/docker_agent_cleaner.sh
@@ -0,0 +1,52 @@
+#!/bin/bash
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Cleans jenkins agents. Primarily used by ci-cassandra.a.o
+#
+# First argument is `maxJobHours`, all docker objects older than this are 
pruned
+#
+# Assumes a CI running multiple C* branches and other jobs
+
+
+# pre-conditions
+command -v docker >/dev/null 2>&1 || { error 1 "docker needs to be installed"; 
}
+command -v virtualenv >/dev/null 2>&1 || { error 1 "virtualenv needs to be 
installed"; }
+(docker info >/dev/null 2>&1) || { error 1 "docker needs to running"; }
+[ -f "./docker_image_pruner.py" ] || { error 1 "./docker_image_pruner.py must 
exist"; }
+
+# arguments
+maxJobHours=12
+[ "$#" -gt 0 ] && maxJobHours=$1
+
+error() {
+echo >&2 $2;
+set -x
+exit $1
+}
+
+echo -n "docker system prune --all --force --filter \"until=${maxJobHours}h\" 
: "
+docker system prune --all --force --filter "until=${maxJobHours}h"
+if !( pgrep -xa docker &> /dev/null || pgrep -af "build/docker" &> /dev/null 
|| pgrep -af "cassandra-builds/build-scripts" &> /dev/null ) ; then
+echo -n "docker system prune --force : "
+docker system prune --force || true ;
+fi;
+
+virtualenv -p python3 -q .venv
+source .venv/bin/activate
+pip -q install requests
+python docker_image_pruner.py
+deactivate
\ No newline at end of file
diff --git a/jenkins-dsl/agent_scripts/docker_image_pruner.py

[jira] [Comment Edited] (CASSANDRA-19558) Standalone jenkinsfile first round bug fixes



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839287#comment-17839287
 ] 

Michael Semb Wever edited comment on CASSANDRA-19558 at 4/23/24 7:31 PM:
-

A number of issues have been identified.
– need to capture the generateTestReports logs, 
– fix ant version in centos7-build.docker
– remove docker login (if credentials exist then docker is logged in)
– prefetch docker images from jfrog (to reduce dockerhub pull rate limits)
– use scripts from cassandra-builds to clean and report on agents
– stream xz where possible (not an issue, just perf improvement)

The docker pull rate limit was the most serious, blocking.

patches…

1.
this is part of CASSANDRA-18594 (most of it is already running (manually 
deployed) to ci-cassandra)
[https://github.com/apache/cassandra-builds/compare/trunk...thelastpickle:cassandra-builds:mck/18594]
( [https://github.com/apache/cassandra-builds/commit/eb3eb5e] )

2.
and for CASSANDRA-19558
[https://github.com/apache/cassandra-builds/compare/trunk...thelastpickle:cassandra-builds:mck/19558]
( 
[https://github.com/thelastpickle/cassandra-builds/commit/9f8ae9dcacd0d744992cc4eaf50f29a8836ffdbd]
 )

3.

and then (also for CASSANDRA-19558 ) but comes last (i can test it manually 
once (2) is committed)
[https://github.com/apache/cassandra/commit/92c0cb7]
(this is on top of the previous commit (that already passed review) in 
[https://github.com/apache/cassandra/compare/apache:cassandra:cassandra-5.0...thelastpickle:cassandra:mck/jenkinsfile-persist-parameters-5.0|https://github.com/apache/cassandra/compare/apache:cassandra:cassandra-5.0...thelastpickle:cassandra:mck/jenkinsfile-persist-parameters-5.0]
 )


was (Author: michaelsembwever):
A number of issues have been identified.
– need to capture the generateTestReports logs, 
– fix ant version in centos7-build.docker
– remove docker login (if credentials exist then docker is logged in)
– prefetch docker images from jfrog (to reduce dockerhub pull rate limits)
– use scripts from cassandra-builds to clean and report on agents
– stream xz where possible (not an issue, just perf improvement)

The docker pull rate limit was the most serious, blocking.

patches…

1.
this is part of CASSANDRA-18594 (most of it is already running (manually 
deployed) to ci-cassandra)
[https://github.com/apache/cassandra-builds/compare/trunk...thelastpickle:cassandra-builds:mck/18594]
( [https://github.com/apache/cassandra-builds/commit/eb3eb5e] )

2.
and for CASSANDRA-19558
[https://github.com/thelastpickle/cassandra-builds/compare/mck/18594...thelastpickle:cassandra-builds:mck/19558]
( 
[https://github.com/thelastpickle/cassandra-builds/commit/9f8ae9dcacd0d744992cc4eaf50f29a8836ffdbd]
 )

3.

and then (also for CASSANDRA-19558 ) but comes last (i can test it manually 
once (2) is committed)
[https://github.com/apache/cassandra/commit/92c0cb7]
(this is on top of the previous commit (that already passed review) in 
[https://github.com/apache/cassandra/compare/apache:cassandra:cassandra-5.0...thelastpickle:cassandra:mck/jenkinsfile-persist-parameters-5.0|https://github.com/apache/cassandra/compare/apache:cassandra:cassandra-5.0...thelastpickle:cassandra:mck/jenkinsfile-persist-parameters-5.0]
 )

> Standalone jenkinsfile first round bug fixes
> 
>
> Key: CASSANDRA-19558
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19558
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 5.0.x, 5.x
>
> Attachments:  CASSANDRA-19558_50_#5_ci_summary.html,  
> CASSANDRA-19558_50_#5_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#13_ci_summary.html, 
> CASSANDRA-19558-5.0_#13_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#16_ci_summary.html, 
> CASSANDRA-19558-5.0_#16_results_details.tar.xz, 
> CASSANDRA-19558_#8_ci_summary.html, CASSANDRA-19558_#8_results_details.tar.xz
>
>
> A few follow up improvements and bug fixes for the standalone jenkinsfile.
> - add at top a list of test failures in ci_summary.html
> - docker scripts always try to login (as base images need to be pulled too)
> - move simulator-dtests to large containers (they need 8g just heap)
> - in ubuntu2004_test.docker make sure /home/cassandra exists and has correct 
> perms (from marcuse)
> - persist the jenkinsfile parameters from run to run (important for the 
> post-commit jobs to keep their non-default branch and profile values) (was 
> CASSANDRA-19536)
> - increase jvm-dtest splits from 8 to 12
> - when on ci-cassandra, replace use of copyArtifacts in Jenkinsfile 
> generateTestReports() with manual wget of test files, allowing the summary 
> phase to be run on any agent (copyArtifact would

[jira] [Comment Edited] (CASSANDRA-19558) Standalone jenkinsfile first round bug fixes



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839287#comment-17839287
 ] 

Michael Semb Wever edited comment on CASSANDRA-19558 at 4/23/24 7:26 PM:
-

A number of issues have been identified.
– need to capture the generateTestReports logs, 
– fix ant version in centos7-build.docker
– remove docker login (if credentials exist then docker is logged in)
– prefetch docker images from jfrog (to reduce dockerhub pull rate limits)
– use scripts from cassandra-builds to clean and report on agents
– stream xz where possible (not an issue, just perf improvement)

The docker pull rate limit was the most serious, blocking.

patches…

1.
this is part of CASSANDRA-18594 (most of it is already running (manually 
deployed) to ci-cassandra)
[https://github.com/apache/cassandra-builds/compare/trunk...thelastpickle:cassandra-builds:mck/18594]
( [https://github.com/apache/cassandra-builds/commit/eb3eb5e] )

2.
and for CASSANDRA-19558
[https://github.com/thelastpickle/cassandra-builds/compare/mck/18594...thelastpickle:cassandra-builds:mck/19558]
( 
[https://github.com/thelastpickle/cassandra-builds/commit/9f8ae9dcacd0d744992cc4eaf50f29a8836ffdbd]
 )

3.

and then (also for CASSANDRA-19558 ) but comes last (i can test it manually 
once (2) is committed)
[https://github.com/apache/cassandra/commit/92c0cb7]
(this is on top of the previous commit (that already passed review) in 
[https://github.com/apache/cassandra/compare/apache:cassandra:cassandra-5.0...thelastpickle:cassandra:mck/jenkinsfile-persist-parameters-5.0|https://github.com/apache/cassandra/compare/apache:cassandra:cassandra-5.0...thelastpickle:cassandra:mck/jenkinsfile-persist-parameters-5.0]
 )


was (Author: michaelsembwever):
A number of issues have been identified.
– need to capture the generateTestReports logs, 
– fix ant version in centos7-build.docker
– remove docker login (if credentials exist then docker is logged in)
– prefetch docker images from jfrog (to reduce dockerhub pull rate limits)
– use scripts from cassandra-builds to clean and report on agents
– stream xz where possible (not an issue, just perf improvement)

The docker pull rate limit was the most serious, blocking.

patches…

1.
this is part of CASSANDRA-18594 (most of it is already running (manually 
deployed) to ci-cassandra)
[https://github.com/apache/cassandra-builds/compare/trunk...thelastpickle:cassandra-builds:mck/18594]
( [https://github.com/apache/cassandra-builds/commit/eb3eb5e] )

2.
and for CASSANDRA-19558
[https://github.com/thelastpickle/cassandra-builds/compare/mck/18594...thelastpickle:cassandra-builds:mck/19558]
( 
[https://github.com/thelastpickle/cassandra-builds/commit/9f8ae9dcacd0d744992cc4eaf50f29a8836ffdbd]
 )

3.

and then (also for CASSANDRA-19558 ) but comes last (i can test it manually 
once (2) is committed)
[https://github.com/apache/cassandra/commit/b3e1e40658e99c37f2a142e247ca4305fcc52eb0]
(this is on top of the previous commit (that already passed review) in 
[https://github.com/apache/cassandra/compare/apache:cassandra:cassandra-5.0...thelastpickle:cassandra:mck/mck/jenkinsfile-persist-parameters-5.0-test]
 )

> Standalone jenkinsfile first round bug fixes
> 
>
> Key: CASSANDRA-19558
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19558
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 5.0.x, 5.x
>
> Attachments:  CASSANDRA-19558_50_#5_ci_summary.html,  
> CASSANDRA-19558_50_#5_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#13_ci_summary.html, 
> CASSANDRA-19558-5.0_#13_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#16_ci_summary.html, 
> CASSANDRA-19558-5.0_#16_results_details.tar.xz, 
> CASSANDRA-19558_#8_ci_summary.html, CASSANDRA-19558_#8_results_details.tar.xz
>
>
> A few follow up improvements and bug fixes for the standalone jenkinsfile.
> - add at top a list of test failures in ci_summary.html
> - docker scripts always try to login (as base images need to be pulled too)
> - move simulator-dtests to large containers (they need 8g just heap)
> - in ubuntu2004_test.docker make sure /home/cassandra exists and has correct 
> perms (from marcuse)
> - persist the jenkinsfile parameters from run to run (important for the 
> post-commit jobs to keep their non-default branch and profile values) (was 
> CASSANDRA-19536)
> - increase jvm-dtest splits from 8 to 12
> - when on ci-cassandra, replace use of copyArtifacts in Jenkinsfile 
> generateTestReports() with manual wget of test files, allowing the summary 
> phase to be run on any agent (copyArtifact would take >4hrs otherwise) (was 
> INFRA-25694)
> - copy ci_summary.html and

[jira] [Updated] (CASSANDRA-19567) Minimize the heap consumption when registering metrics



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-19567:

Test and Documentation Plan: n/a
 Status: Patch Available  (was: In Progress)

> Minimize the heap consumption when registering metrics
> --
>
> Key: CASSANDRA-19567
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19567
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Maxim Muzafarov
>Assignee: Maxim Muzafarov
>Priority: Normal
> Fix For: 5.x
>
> Attachments: summary.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The problem is only reproducible on the x86 machine, the problem is not 
> reproducible on the arm64. A quick analysis showed a lot of MetricName 
> objects stored in the heap, although the real cause could be related to 
> something else, the MetricName object requires extra attention.
> To reproduce run the command run locally:
> {code}
> ant test-jvm-dtest-some 
> -Dtest.name=org.apache.cassandra.distributed.test.ReadRepairTest
> {code}
> The error:
> {code:java}
> [junit-timeout] Exception in thread "main" java.lang.OutOfMemoryError: Java 
> heap space
> [junit-timeout]     at 
> java.base/java.lang.StringLatin1.newString(StringLatin1.java:769)
> [junit-timeout]     at 
> java.base/java.lang.StringBuffer.toString(StringBuffer.java:716)
> [junit-timeout]     at 
> org.apache.cassandra.CassandraBriefJUnitResultFormatter.endTestSuite(CassandraBriefJUnitResultFormatter.java:191)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.fireEndTestSuite(JUnitTestRunner.java:854)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:578)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:1197)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:1042)
> [junit-timeout] Testsuite: 
> org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED
> [junit-timeout] Testsuite: 
> org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED
>  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0 sec
> [junit-timeout] 
> [junit-timeout] Testcase: 
> org.apache.cassandra.distributed.test.ReadRepairTest:readRepairRTRangeMovementTest-cassandra.testtag_IS_UNDEFINED:
>     Caused an ERROR
> [junit-timeout] Forked Java VM exited abnormally. Please note the time in the 
> report does not reflect the time until the VM exit.
> [junit-timeout] junit.framework.AssertionFailedError: Forked Java VM exited 
> abnormally. Please note the time in the report does not reflect the time 
> until the VM exit.
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at java.base/java.util.Vector.forEach(Vector.java:1365)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at java.base/java.util.Vector.forEach(Vector.java:1365)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout] 
> [junit-timeout] 
> [junit-timeout] Test org.apache.cassandra.distributed.test.ReadRepairTest 
> FAILED (crashed)BUILD FAILED
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840204#comment-17840204
 ] 

Alex Petrov commented on CASSANDRA-19221:
-

Addressed your comments [~samt], both failures are timeouts that are unrelated 
to the patch. I believe we should split the {{MetadataChangeSimulationTest}} 
since after adding transient tests it seems to sometimes cross the timeout 
deadline.

> CMS: Nodes can restart with new ipaddress already defined in the cluster
> 
>
> Key: CASSANDRA-19221
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19221
> Project: Cassandra
>  Issue Type: Bug
>  Components: Transactional Cluster Metadata
>Reporter: Paul Chandler
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-1.html, ci_summary.html
>
>
> I am simulating running a cluster in Kubernetes and testing what happens when 
> several pods go down and  ip addresses are swapped between nodes. In 4.0 this 
> is blocked and the node cannot be restarted.
> To simulate this I create a 3 node cluster on a local machine using 3 
> loopback addresses
> {code}
> 127.0.0.1
> 127.0.0.2
> 127.0.0.3
> {code}
> The nodes are created correctly and the first node is assigned as a CMS node 
> as shown:
> {code}
> bin/nodetool -p 7199 describecms
> {code}
> Cluster Metadata Service:
> {code}
> Members: /127.0.0.1:7000
> Is Member: true
> Service State: LOCAL
> {code}
> At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip 
> addresses for the rpc_address and listen_address 
>  
> The nodes come back as normal, but the nodeid has now been swapped against 
> the ip address:
> Before:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load       Tokens  Owns (effective)  Host ID                   
>             Rack
> UN  127.0.0.3  75.2 KiB   16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  86.77 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  80.88 KiB  16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> After:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load        Tokens  Owns (effective)  Host ID                  
>              Rack
> UN  127.0.0.3  149.62 KiB  16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  155.48 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  75.74 KiB   16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> On previous tests of this I have created a table with a replication factor of 
> 1, inserted some data before the swap.   After the swap the data on nodes 2 
> and 3 is now missing. 
> One theory I have is that I am using different port numbers for the different 
> nodes, and I am only swapping the ip addresses and not the port numbers, so 
> the ip:port still looks unique
> i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044
> and 127.0.0.3:9044 becomes 127.0.0.3:9043
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19567) Minimize the heap consumption when registering metrics



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840203#comment-17840203
 ] 

Caleb Rackliffe commented on CASSANDRA-19567:
-

That looks promising. Reviewing shortly...

> Minimize the heap consumption when registering metrics
> --
>
> Key: CASSANDRA-19567
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19567
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Maxim Muzafarov
>Assignee: Maxim Muzafarov
>Priority: Normal
> Fix For: 5.x
>
> Attachments: summary.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The problem is only reproducible on the x86 machine, the problem is not 
> reproducible on the arm64. A quick analysis showed a lot of MetricName 
> objects stored in the heap, although the real cause could be related to 
> something else, the MetricName object requires extra attention.
> To reproduce run the command run locally:
> {code}
> ant test-jvm-dtest-some 
> -Dtest.name=org.apache.cassandra.distributed.test.ReadRepairTest
> {code}
> The error:
> {code:java}
> [junit-timeout] Exception in thread "main" java.lang.OutOfMemoryError: Java 
> heap space
> [junit-timeout]     at 
> java.base/java.lang.StringLatin1.newString(StringLatin1.java:769)
> [junit-timeout]     at 
> java.base/java.lang.StringBuffer.toString(StringBuffer.java:716)
> [junit-timeout]     at 
> org.apache.cassandra.CassandraBriefJUnitResultFormatter.endTestSuite(CassandraBriefJUnitResultFormatter.java:191)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.fireEndTestSuite(JUnitTestRunner.java:854)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:578)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:1197)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:1042)
> [junit-timeout] Testsuite: 
> org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED
> [junit-timeout] Testsuite: 
> org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED
>  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0 sec
> [junit-timeout] 
> [junit-timeout] Testcase: 
> org.apache.cassandra.distributed.test.ReadRepairTest:readRepairRTRangeMovementTest-cassandra.testtag_IS_UNDEFINED:
>     Caused an ERROR
> [junit-timeout] Forked Java VM exited abnormally. Please note the time in the 
> report does not reflect the time until the VM exit.
> [junit-timeout] junit.framework.AssertionFailedError: Forked Java VM exited 
> abnormally. Please note the time in the report does not reflect the time 
> until the VM exit.
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at java.base/java.util.Vector.forEach(Vector.java:1365)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at java.base/java.util.Vector.forEach(Vector.java:1365)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout] 
> [junit-timeout] 
> [junit-timeout] Test org.apache.cassandra.distributed.test.ReadRepairTest 
> FAILED (crashed)BUILD FAILED
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19567) Minimize the heap consumption when registering metrics



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-19567:

Reviewers: Caleb Rackliffe, Caleb Rackliffe  (was: Caleb Rackliffe)
   Caleb Rackliffe, Caleb Rackliffe  (was: Caleb Rackliffe)
   Status: Review In Progress  (was: Patch Available)

> Minimize the heap consumption when registering metrics
> --
>
> Key: CASSANDRA-19567
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19567
> Project: Cassandra
>  Issue Type: Bug
>  Components: Observability/Metrics
>Reporter: Maxim Muzafarov
>Assignee: Maxim Muzafarov
>Priority: Normal
> Fix For: 5.x
>
> Attachments: summary.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The problem is only reproducible on the x86 machine, the problem is not 
> reproducible on the arm64. A quick analysis showed a lot of MetricName 
> objects stored in the heap, although the real cause could be related to 
> something else, the MetricName object requires extra attention.
> To reproduce run the command run locally:
> {code}
> ant test-jvm-dtest-some 
> -Dtest.name=org.apache.cassandra.distributed.test.ReadRepairTest
> {code}
> The error:
> {code:java}
> [junit-timeout] Exception in thread "main" java.lang.OutOfMemoryError: Java 
> heap space
> [junit-timeout]     at 
> java.base/java.lang.StringLatin1.newString(StringLatin1.java:769)
> [junit-timeout]     at 
> java.base/java.lang.StringBuffer.toString(StringBuffer.java:716)
> [junit-timeout]     at 
> org.apache.cassandra.CassandraBriefJUnitResultFormatter.endTestSuite(CassandraBriefJUnitResultFormatter.java:191)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.fireEndTestSuite(JUnitTestRunner.java:854)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:578)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:1197)
> [junit-timeout]     at 
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:1042)
> [junit-timeout] Testsuite: 
> org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED
> [junit-timeout] Testsuite: 
> org.apache.cassandra.distributed.test.ReadRepairTest-cassandra.testtag_IS_UNDEFINED
>  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0 sec
> [junit-timeout] 
> [junit-timeout] Testcase: 
> org.apache.cassandra.distributed.test.ReadRepairTest:readRepairRTRangeMovementTest-cassandra.testtag_IS_UNDEFINED:
>     Caused an ERROR
> [junit-timeout] Forked Java VM exited abnormally. Please note the time in the 
> report does not reflect the time until the VM exit.
> [junit-timeout] junit.framework.AssertionFailedError: Forked Java VM exited 
> abnormally. Please note the time in the report does not reflect the time 
> until the VM exit.
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at java.base/java.util.Vector.forEach(Vector.java:1365)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at java.base/java.util.Vector.forEach(Vector.java:1365)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout]     at 
> jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> [junit-timeout]     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout] 
> [junit-timeout] 
> [junit-timeout] Test org.apache.cassandra.distributed.test.ReadRepairTest 
> FAILED (crashed)BUILD FAILED
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19221:

Attachment: ci_summary-1.html

> CMS: Nodes can restart with new ipaddress already defined in the cluster
> 
>
> Key: CASSANDRA-19221
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19221
> Project: Cassandra
>  Issue Type: Bug
>  Components: Transactional Cluster Metadata
>Reporter: Paul Chandler
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-1.html, ci_summary.html
>
>
> I am simulating running a cluster in Kubernetes and testing what happens when 
> several pods go down and  ip addresses are swapped between nodes. In 4.0 this 
> is blocked and the node cannot be restarted.
> To simulate this I create a 3 node cluster on a local machine using 3 
> loopback addresses
> {code}
> 127.0.0.1
> 127.0.0.2
> 127.0.0.3
> {code}
> The nodes are created correctly and the first node is assigned as a CMS node 
> as shown:
> {code}
> bin/nodetool -p 7199 describecms
> {code}
> Cluster Metadata Service:
> {code}
> Members: /127.0.0.1:7000
> Is Member: true
> Service State: LOCAL
> {code}
> At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip 
> addresses for the rpc_address and listen_address 
>  
> The nodes come back as normal, but the nodeid has now been swapped against 
> the ip address:
> Before:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load       Tokens  Owns (effective)  Host ID                   
>             Rack
> UN  127.0.0.3  75.2 KiB   16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  86.77 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  80.88 KiB  16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> After:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load        Tokens  Owns (effective)  Host ID                  
>              Rack
> UN  127.0.0.3  149.62 KiB  16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  155.48 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  75.74 KiB   16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> On previous tests of this I have created a table with a replication factor of 
> 1, inserted some data before the swap.   After the swap the data on nodes 2 
> and 3 is now missing. 
> One theory I have is that I am using different port numbers for the different 
> nodes, and I am only swapping the ip addresses and not the port numbers, so 
> the ip:port still looks unique
> i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044
> and 127.0.0.3:9044 becomes 127.0.0.3:9043
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15439) Token metadata for bootstrapping nodes is lost under temporary failures

2024-04-23 Thread Raymond Huffman (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840198#comment-17840198
 ] 

Raymond Huffman commented on CASSANDRA-15439:
-

Here's a PR that implements the additional config option.

https://github.com/apache/cassandra/pull/3270

> Token metadata for bootstrapping nodes is lost under temporary failures
> ---
>
> Key: CASSANDRA-15439
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15439
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Membership
>Reporter: Josh Snyder
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0.x, 4.1.x, 5.0.x, 5.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In CASSANDRA-8838, [~pauloricardomg] asked "hints will not be stored to the 
> bootstrapping node after RING_DELAY, since it will evicted from the TMD 
> pending ranges. Should we create a ticket to address this?"
> CASSANDRA-15264 relates to the most likely cause of such situations, where 
> the Cassandra daemon on the bootstrapping node completely crashes. Based on 
> testing with {{kill -STOP}} on a bootstrapping Cassandra JVM, I believe it 
> also is possible to remove token metadata (and thus pending ranges, and thus 
> hints) for a bootstrapping node, simply by affecting its status in the 
> failure detector. 
> A node in the cluster sees the bootstrapping node this way:
> {noformat}
> INFO  [GossipStage:1] 2019-11-27 20:41:41,101 Gossiper.java: - Node 
> /PUBLIC-IP is now part of the cluster
> INFO  [GossipStage:1] 2019-11-27 20:41:41,199 Gossiper.java:1073 - 
> InetAddress /PUBLIC-IP is now UP
> INFO  [HANDSHAKE-/PRIVATE-IP] 2019-11-27 20:41:41,412 
> OutboundTcpConnection.java:565 - Handshaking version with /PRIVATE-IP
> INFO  [STREAM-INIT-/PRIVATE-IP:21233] 2019-11-27 20:42:10,019 
> StreamResultFuture.java:112 - [Stream #6219a950-1156-11ea-b45d-4d30364576c4 
> ID#0] Creating new streaming plan for Bootstrap
> INFO  [STREAM-INIT-/PRIVATE-IP:21233] 2019-11-27 20:42:10,020 
> StreamResultFuture.java:119 - [Stream #6219a950-1156-11ea-b45d-4d30364576c4, 
> ID#0] Received streaming plan for Bootstrap
> INFO  [STREAM-INIT-/PRIVATE-IP:56003] 2019-11-27 20:42:10,112 
> StreamResultFuture.java:119 - [Stream #6219a950-1156-11ea-b45d-4d30364576c4, 
> ID#0] Received streaming plan for Bootstrap
> INFO  [STREAM-IN-/PUBLIC-IP] 2019-11-27 20:42:10,179 
> StreamResultFuture.java:169 - [Stream #6219a950-1156-11ea-b45d-4d30364576c4 
> ID#0] Prepare completed. Receiving 0 files(0 bytes), sending 833 
> files(139744616815 bytes)
> INFO  [GossipStage:1] 2019-11-27 20:54:47,547 Gossiper.java:1089 - 
> InetAddress /PUBLIC-IP is now DOWN
> INFO  [GossipTasks:1] 2019-11-27 20:54:57,551 Gossiper.java:849 - FatClient 
> /PUBLIC-IP has been silent for 3ms, removing from gossip
> {noformat}
> Since the bootstrapping node has no tokens, it is treated like a fat client, 
> and it is removed from the ring. For correctness purposes, I believe we must 
> keep storing hints for the downed bootstrapping node until it is either 
> assassinated or until a replacement attempts to bootstrap for the same token.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19564) MemtablePostFlush deadlock leads to stuck nodes and crashes



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840197#comment-17840197
 ] 

Jon Haddad edited comment on CASSANDRA-19564 at 4/23/24 6:53 PM:
-

I've rolled out a small patch 
(https://github.com/rustyrazorblade/cassandra/tree/jhaddad/4.1.4-extra-logging) 
that adds a monitoring thread to each flush and have found that in every case I 
see stacktraces related to the filesystem, which is ZFS:

{noformat}
"MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE
at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink0(Native 
Method)
at 
java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:156)
at 
java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:236)
at 
java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:306)
...

Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a
{noformat}

and 

{noformat}
"MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE
at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat0(Native 
Method)
at 
java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat(UnixNativeDispatcher.java:332)
at 
java.base@11.0.22/sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:72)
at 
java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:232)
at 
java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299)
...

Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a
{noformat}

I have several dozen of these and am finding in every case we're either at 
{{java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink0}} or 
{{java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat0}}

I'm moving this cluster off ZFS and onto XFS, if I find the issue goes away 
I'll close this out.  I don't think there's anything we can do about unreliable 
filesystems other than improving our error reporting around it.



was (Author: rustyrazorblade):
I've rolled out a small patch 
(https://github.com/rustyrazorblade/cassandra/tree/jhaddad/4.1.4-extra-logging) 
that adds a monitoring thread to each flush and have found that in every case I 
see stacktraces related to the filesystem (ZFS), such as this:

{noformat}
"MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE
at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink0(Native 
Method)
at 
java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:156)
at 
java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:236)
at 
java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:306)
...

Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a
{noformat}

and 

{noformat}
"MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE
at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat0(Native 
Method)
at 
java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat(UnixNativeDispatcher.java:332)
at 
java.base@11.0.22/sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:72)
at 
java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:232)
at 
java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299)
...

Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a
{noformat}

I have

[jira] [Commented] (CASSANDRA-19564) MemtablePostFlush deadlock leads to stuck nodes and crashes



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840197#comment-17840197
 ] 

Jon Haddad commented on CASSANDRA-19564:


I've rolled out a small patch 
(https://github.com/rustyrazorblade/cassandra/tree/jhaddad/4.1.4-extra-logging) 
that adds a monitoring thread to each flush and have found that in every case I 
see stacktraces related to the filesystem (ZFS), such as this:

{noformat}
"MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE
at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink0(Native 
Method)
at 
java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:156)
at 
java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:236)
at 
java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:306)
...

Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a
{noformat}

and 

{noformat}
"MemtablePostFlush:1" daemon prio=5 Id=429 RUNNABLE
at java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat0(Native 
Method)
at 
java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat(UnixNativeDispatcher.java:332)
at 
java.base@11.0.22/sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:72)
at 
java.base@11.0.22/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:232)
at 
java.base@11.0.22/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
at java.base@11.0.22/java.nio.file.Files.delete(Files.java:1142)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:252)
at 
app//org.apache.cassandra.io.util.PathUtils.delete(PathUtils.java:299)
...

Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@1fc5251a
{noformat}

I have several dozen of these and am finding in every case we're either at 
{{java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.unlink0}} or 
{{java.base@11.0.22/sun.nio.fs.UnixNativeDispatcher.lstat0}}

I'm moving this cluster off ZFS and onto XFS, if I find the issue goes away 
I'll close this out.  I don't think there's anything we can do about unreliable 
filesystems other than improving our error reporting around it.


> MemtablePostFlush deadlock leads to stuck nodes and crashes
> ---
>
> Key: CASSANDRA-19564
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19564
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Compaction, Local/Memtable
>Reporter: Jon Haddad
>Priority: Urgent
> Fix For: 4.1.x
>
> Attachments: image-2024-04-16-11-55-54-750.png, 
> image-2024-04-16-12-29-15-386.png, image-2024-04-16-13-43-11-064.png, 
> image-2024-04-16-13-53-24-455.png, image-2024-04-17-18-46-29-474.png, 
> image-2024-04-17-19-13-06-769.png, image-2024-04-17-19-14-34-344.png, 
> screenshot-1.png
>
>
> I've run into an issue on a 4.1.4 cluster where an entire node has locked up 
> due to what I believe is a deadlock in memtable flushing. Here's what I know 
> so far.  I've stitched together what happened based on conversations, logs, 
> and some flame graphs.
> *Log reports memtable flushing*
> The last successful flush happens at 12:19. 
> {noformat}
> INFO  [NativePoolCleaner] 2024-04-16 12:19:53,634 
> AbstractAllocatorMemtable.java:286 - Flushing largest CFS(Keyspace='ks', 
> ColumnFamily='version') to free up room. Used total: 0.24/0.33, live: 
> 0.16/0.20, flushing: 0.09/0.13, this: 0.13/0.15
> INFO  [NativePoolCleaner] 2024-04-16 12:19:53,634 ColumnFamilyStore.java:1012 
> - Enqueuing flush of ks.version, Reason: MEMTABLE_LIMIT, Usage: 660.521MiB 
> (13%) on-heap, 790.606MiB (15%) off-heap
> {noformat}
> *MemtablePostFlush appears to be blocked*
> At this point, MemtablePostFlush completed tasks stops incrementing, active 
> stays at 1 and pending starts to rise.
> {noformat}
> MemtablePostFlush   1    1   3446   0   0
> {noformat}
>  
> The flame graph reveals that PostFlush.call is stuck.  I don't have the line 
> number, but I know we're stuck in 
> {{org.apache.cassandra.db.ColumnFamilyStore.PostFlush#call}} given the visual 
> below:
> *!image-2024-04-16-13-43-11-064.png!*
> *Memtable flushing is now blocked.*
> All MemtableFlushWriter threads are Parked waiting on 
> {{{}OpOrder.Barrier.await{}}}. A wall clock

[jira] [Updated] (CASSANDRA-19582) [Analytics] Consume new Sidecar client API to stream SSTables



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francisco Guerrero updated CASSANDRA-19582:
---
Status: Ready to Commit  (was: Review In Progress)

> [Analytics] Consume new Sidecar client API to stream SSTables
> -
>
> Key: CASSANDRA-19582
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19582
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Analytics Library
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A new client API was recently introduced in Sidecar to stream SSTables. 
> Cassandra Analytics needs to start consuming the new API in order to take 
> advantage of the fixes when streaming SSTables from a Cassandra installation 
> with more than one data directory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19582) [Analytics] Consume new Sidecar client API to stream SSTables



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francisco Guerrero updated CASSANDRA-19582:
---
Reviewers: Yifan Cai
   Status: Review In Progress  (was: Patch Available)

> [Analytics] Consume new Sidecar client API to stream SSTables
> -
>
> Key: CASSANDRA-19582
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19582
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Analytics Library
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A new client API was recently introduced in Sidecar to stream SSTables. 
> Cassandra Analytics needs to start consuming the new API in order to take 
> advantage of the fixes when streaming SSTables from a Cassandra installation 
> with more than one data directory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840173#comment-17840173
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

Sorry for the lack of clarity; today there’s no deadline at all. Tasks will 
live in the system essentially forever clogging queues doing busy work. I was 
intending to post a patch but it is currently in my CI queue; however otherwise 
ready to go.



i believe with 12 seconds default, users will only see an improvement and there 
will be no learning curve at all. All configurable are for the people who 
understand their request lifetimes and want to get an even better profile.

 

 

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840171#comment-17840171
 ] 

Brandon Williams edited comment on CASSANDRA-19534 at 4/23/24 5:53 PM:
---

I think this all sounds good, though there may be a bit of a learning curve for 
users. Native request deadline is easy enough to understand, but things get a 
bit nuanced past that.

Regarding native_transport_timeout_in_ms:
bq. Default is 100 seconds, which is unreasonably high, but not unbounded. In 
practice, we should use at most 12 seconds.

Do you mean this currently exists at 100? If not, what is the rationale for 
that default?


was (Author: brandon.williams):
I think this all sounds good, though there may be a bit of a learning curve for 
users. Native request deadline is easy enough to understand, but things get a 
bit nuanced past that.

bq. Default is 100 seconds, which is unreasonably high, but not unbounded. In 
practice, we should use at most 12 seconds.

Do you mean this currently exists at 100? If not, what is the rationale for 
that default?

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840171#comment-17840171
 ] 

Brandon Williams commented on CASSANDRA-19534:
--

I think this all sounds good, though there may be a bit of a learning curve for 
users. Native request deadline is easy enough to understand, but things get a 
bit nuanced past that.

bq. Default is 100 seconds, which is unreasonably high, but not unbounded. In 
practice, we should use at most 12 seconds.

Do you mean this currently exists at 100? If not, what is the rationale for 
that default?

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRASC-123) Add missing method to retrieve the InetSocketAddress to DriverUtils



 [ 
https://issues.apache.org/jira/browse/CASSANDRASC-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francisco Guerrero updated CASSANDRASC-123:
---
  Fix Version/s: 1.0
Source Control Link: 
https://github.com/apache/cassandra-sidecar/commit/77c815071a66fb53b97e9e07695417004dd88804
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

> Add missing method to retrieve the InetSocketAddress to DriverUtils
> ---
>
> Key: CASSANDRASC-123
> URL: https://issues.apache.org/jira/browse/CASSANDRASC-123
> Project: Sidecar for Apache Cassandra
>  Issue Type: Bug
>  Components: Rest API
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Normal
>  Labels: pull-request-available
> Fix For: 1.0
>
>
> Sidecar introduced a shim layer to access the java driver in CASSANDRASC-79, 
> and later enhanced that access in CASSANDRASC-88. However, the 
> {{getInetSocketAddress}} was missed in the shim layer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

(cassandra-sidecar) branch trunk updated: CASSANDRASC-123: Add missing method to retrieve the InetSocketAddress to DriverUtils (#114)

2024-04-23 Thread frankgh

This is an automated email from the ASF dual-hosted git repository.

frankgh pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra-sidecar.git


The following commit(s) were added to refs/heads/trunk by this push:
 new 77c8150  CASSANDRASC-123: Add missing method to retrieve the 
InetSocketAddress to DriverUtils (#114)
77c8150 is described below

commit 77c815071a66fb53b97e9e07695417004dd88804
Author: Francisco Guerrero 
AuthorDate: Tue Apr 23 10:44:28 2024 -0700

CASSANDRASC-123: Add missing method to retrieve the InetSocketAddress to 
DriverUtils (#114)

Patch by Francisco Guerrero; Reviewed by Yifan Cai for CASSANDRASC-123
---
 .../apache/cassandra/sidecar/common/utils/DriverUtils.java| 11 +++
 .../cassandra/sidecar/cluster/CassandraAdapterDelegate.java   |  2 +-
 .../cassandra/sidecar/cluster/SidecarLoadBalancingPolicy.java |  2 +-
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git 
a/common/src/main/java/org/apache/cassandra/sidecar/common/utils/DriverUtils.java
 
b/common/src/main/java/org/apache/cassandra/sidecar/common/utils/DriverUtils.java
index b070637..aea1351 100644
--- 
a/common/src/main/java/org/apache/cassandra/sidecar/common/utils/DriverUtils.java
+++ 
b/common/src/main/java/org/apache/cassandra/sidecar/common/utils/DriverUtils.java
@@ -54,4 +54,15 @@ public class DriverUtils
 {
 return com.datastax.driver.core.DriverUtils.getHost(metadata, 
localNativeTransportAddress);
 }
+
+/**
+ * Returns the address that the driver will use to connect to the node.
+ *
+ * @param host the host to which reconnect attempts will be made
+ * @return the address.
+ */
+public InetSocketAddress getSocketAddress(Host host)
+{
+return host.getEndPoint().resolve();
+}
 }
diff --git 
a/src/main/java/org/apache/cassandra/sidecar/cluster/CassandraAdapterDelegate.java
 
b/src/main/java/org/apache/cassandra/sidecar/cluster/CassandraAdapterDelegate.java
index cc0a952..5a1628d 100644
--- 
a/src/main/java/org/apache/cassandra/sidecar/cluster/CassandraAdapterDelegate.java
+++ 
b/src/main/java/org/apache/cassandra/sidecar/cluster/CassandraAdapterDelegate.java
@@ -547,7 +547,7 @@ public class CassandraAdapterDelegate implements 
ICassandraAdapter, Host.StateLi
 
 private void runIfThisHost(Host host, Runnable runnable)
 {
-if 
(this.localNativeTransportAddress.equals(host.getEndPoint().resolve()))
+if 
(this.localNativeTransportAddress.equals(driverUtils.getSocketAddress(host)))
 {
 runnable.run();
 }
diff --git 
a/src/main/java/org/apache/cassandra/sidecar/cluster/SidecarLoadBalancingPolicy.java
 
b/src/main/java/org/apache/cassandra/sidecar/cluster/SidecarLoadBalancingPolicy.java
index bd6ae95..7bbe6be 100644
--- 
a/src/main/java/org/apache/cassandra/sidecar/cluster/SidecarLoadBalancingPolicy.java
+++ 
b/src/main/java/org/apache/cassandra/sidecar/cluster/SidecarLoadBalancingPolicy.java
@@ -233,6 +233,6 @@ class SidecarLoadBalancingPolicy implements 
LoadBalancingPolicy
 
 private boolean isLocalHost(Host host)
 {
-return localHostAddresses.contains(host.getEndPoint().resolve());
+return localHostAddresses.contains(driverUtils.getSocketAddress(host));
 }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRASC-123) Add missing method to retrieve the InetSocketAddress to DriverUtils

2024-04-23 Thread Yifan Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRASC-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Cai updated CASSANDRASC-123:
--
Reviewers: Yifan Cai, Yifan Cai
   Status: Review In Progress  (was: Patch Available)

+1

> Add missing method to retrieve the InetSocketAddress to DriverUtils
> ---
>
> Key: CASSANDRASC-123
> URL: https://issues.apache.org/jira/browse/CASSANDRASC-123
> Project: Sidecar for Apache Cassandra
>  Issue Type: Bug
>  Components: Rest API
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Normal
>  Labels: pull-request-available
>
> Sidecar introduced a shim layer to access the java driver in CASSANDRASC-79, 
> and later enhanced that access in CASSANDRASC-88. However, the 
> {{getInetSocketAddress}} was missed in the shim layer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRASC-123) Add missing method to retrieve the InetSocketAddress to DriverUtils

2024-04-23 Thread Yifan Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRASC-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Cai updated CASSANDRASC-123:
--
Status: Ready to Commit  (was: Review In Progress)

> Add missing method to retrieve the InetSocketAddress to DriverUtils
> ---
>
> Key: CASSANDRASC-123
> URL: https://issues.apache.org/jira/browse/CASSANDRASC-123
> Project: Sidecar for Apache Cassandra
>  Issue Type: Bug
>  Components: Rest API
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Normal
>  Labels: pull-request-available
>
> Sidecar introduced a shim layer to access the java driver in CASSANDRASC-79, 
> and later enhanced that access in CASSANDRASC-88. However, the 
> {{getInetSocketAddress}} was missed in the shim layer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19582) [Analytics] Consume new Sidecar client API to stream SSTables



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francisco Guerrero updated CASSANDRA-19582:
---
Test and Documentation Plan: Update unit tests to use the new API
 Status: Patch Available  (was: In Progress)

PR: https://github.com/apache/cassandra-analytics/pull/54
CI: https://app.circleci.com/pipelines/github/frankgh/cassandra-analytics/178

> [Analytics] Consume new Sidecar client API to stream SSTables
> -
>
> Key: CASSANDRA-19582
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19582
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Analytics Library
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A new client API was recently introduced in Sidecar to stream SSTables. 
> Cassandra Analytics needs to start consuming the new API in order to take 
> advantage of the fixes when streaming SSTables from a Cassandra installation 
> with more than one data directory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRASC-123) Add missing method to retrieve the InetSocketAddress to DriverUtils



 [ 
https://issues.apache.org/jira/browse/CASSANDRASC-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francisco Guerrero updated CASSANDRASC-123:
---
Authors: Francisco Guerrero
Test and Documentation Plan: Existing unit tests
 Status: Patch Available  (was: In Progress)

PR: https://github.com/apache/cassandra-sidecar/pull/114
CI: https://app.circleci.com/pipelines/github/frankgh/cassandra-sidecar/481

> Add missing method to retrieve the InetSocketAddress to DriverUtils
> ---
>
> Key: CASSANDRASC-123
> URL: https://issues.apache.org/jira/browse/CASSANDRASC-123
> Project: Sidecar for Apache Cassandra
>  Issue Type: Bug
>  Components: Rest API
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Normal
>  Labels: pull-request-available
>
> Sidecar introduced a shim layer to access the java driver in CASSANDRASC-79, 
> and later enhanced that access in CASSANDRASC-88. However, the 
> {{getInetSocketAddress}} was missed in the shim layer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRASC-123) Add missing method to retrieve the InetSocketAddress to DriverUtils



 [ 
https://issues.apache.org/jira/browse/CASSANDRASC-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francisco Guerrero updated CASSANDRASC-123:
---
 Bug Category: Parent values: Correctness(12982)Level 1 values: 
Consistency(12989)
   Complexity: Low Hanging Fruit
Discovered By: User Report
 Severity: Normal
   Status: Open  (was: Triage Needed)

> Add missing method to retrieve the InetSocketAddress to DriverUtils
> ---
>
> Key: CASSANDRASC-123
> URL: https://issues.apache.org/jira/browse/CASSANDRASC-123
> Project: Sidecar for Apache Cassandra
>  Issue Type: Bug
>  Components: Rest API
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Normal
>
> Sidecar introduced a shim layer to access the java driver in CASSANDRASC-79, 
> and later enhanced that access in CASSANDRASC-88. However, the 
> {{getInetSocketAddress}} was missed in the shim layer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRASC-123) Add missing method to retrieve the InetSocketAddress to DriverUtils

Francisco Guerrero created CASSANDRASC-123:
--

 Summary: Add missing method to retrieve the InetSocketAddress to 
DriverUtils
 Key: CASSANDRASC-123
 URL: https://issues.apache.org/jira/browse/CASSANDRASC-123
 Project: Sidecar for Apache Cassandra
  Issue Type: Bug
  Components: Rest API
Reporter: Francisco Guerrero
Assignee: Francisco Guerrero


Sidecar introduced a shim layer to access the java driver in CASSANDRASC-79, 
and later enhanced that access in CASSANDRASC-88. However, the 
{{getInetSocketAddress}} was missed in the shim layer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19577) Queries are not visible to the "system_views.queries" virtual table at the coordinator level



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839848#comment-17839848
 ] 

Caleb Rackliffe edited comment on CASSANDRA-19577 at 4/23/24 4:29 PM:
--

The [4.1 patch|https://github.com/apache/cassandra/pull/3268] is up.

CI results are clean, modulo a couple environment specific things, OOMs, and 
known issues that have nothing to do w/ vtables or {{DebuggableTask}}.


was (Author: maedhroz):
The [4.1 patch|https://github.com/apache/cassandra/pull/3268] is up. CI results 
will be posted soon...

> Queries are not visible to the "system_views.queries" virtual table at the 
> coordinator level
> 
>
> Key: CASSANDRA-19577
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19577
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/Virtual Tables
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.1
>
> Attachments: ci_summary.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There appears to be a hole in the implementation of CASSANDRA-15241 where 
> {{DebuggableTasks}} at the coordinator are not preserved through the creation 
> of {{FutureTasks}} in {{TaskFactory}}. This means that {{QueriesTable}} can't 
> see them when is asks {{SharedExecutorPool}} for running tasks. It should be 
> possible to fix this in {{TaskFactory}} by making sure to propagate any 
> {{RunnableDebuggableTask}} we encounter. We already do this in 
> {{toExecute()}}, but it also needs to happen in the relevant {{toSubmit()}} 
> method(s).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19577) Queries are not visible to the "system_views.queries" virtual table at the coordinator level



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-19577:

Attachment: ci_summary.html

> Queries are not visible to the "system_views.queries" virtual table at the 
> coordinator level
> 
>
> Key: CASSANDRA-19577
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19577
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/Virtual Tables
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.1
>
> Attachments: ci_summary.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There appears to be a hole in the implementation of CASSANDRA-15241 where 
> {{DebuggableTasks}} at the coordinator are not preserved through the creation 
> of {{FutureTasks}} in {{TaskFactory}}. This means that {{QueriesTable}} can't 
> see them when is asks {{SharedExecutorPool}} for running tasks. It should be 
> possible to fix this in {{TaskFactory}} by making sure to propagate any 
> {{RunnableDebuggableTask}} we encounter. We already do this in 
> {{toExecute()}}, but it also needs to happen in the relevant {{toSubmit()}} 
> method(s).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-18112) Add the feature of INDEX HINT for CQL



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-18112:

Epic Link: CASSANDRA-19224

> Add the feature of INDEX HINT for CQL 
> --
>
> Key: CASSANDRA-18112
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18112
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL/Syntax, Feature/SAI, Legacy/CQL
>Reporter: Maxwell Guo
>Assignee: Maxwell Guo
>Priority: Normal
> Fix For: 5.0.x
>
>
> It seems that  CQL do not have the ability of INDEX HINT , such as when we 
> have more than one secondary index for some data table,And if the query hit 
> the indexes, the index with more estimate rows will be returned. But if we 
> want the query to be executed under our willing , we can use a hint like 
> ,hint specified index or ignore the index.
> At first I want to open a jira that to add the feature of hint for CQL ,But I 
> think that may be  a gigantic task with no clear goal.
> Besides I think there may need a DISSCUSS for the specific grammatical form 
> before starting the work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19577) Queries are not visible to the "system_views.queries" virtual table at the coordinator level



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-19577:

Reviewers: Chris Lohfink

> Queries are not visible to the "system_views.queries" virtual table at the 
> coordinator level
> 
>
> Key: CASSANDRA-19577
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19577
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/Virtual Tables
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There appears to be a hole in the implementation of CASSANDRA-15241 where 
> {{DebuggableTasks}} at the coordinator are not preserved through the creation 
> of {{FutureTasks}} in {{TaskFactory}}. This means that {{QueriesTable}} can't 
> see them when is asks {{SharedExecutorPool}} for running tasks. It should be 
> possible to fix this in {{TaskFactory}} by making sure to propagate any 
> {{RunnableDebuggableTask}} we encounter. We already do this in 
> {{toExecute()}}, but it also needs to happen in the relevant {{toSubmit()}} 
> method(s).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Assigned] (CASSANDRA-19572) Test failure: org.apache.cassandra.db.ImportTest flakiness



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Miklosovic reassigned CASSANDRA-19572:
-

Assignee: Stefan Miklosovic

> Test failure: org.apache.cassandra.db.ImportTest flakiness
> --
>
> Key: CASSANDRA-19572
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19572
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Brandon Williams
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> As discovered on CASSANDRA-19401, the tests in this class are flaky, at least 
> the following:
>  * testImportCorruptWithoutValidationWithCopying
>  * testImportInvalidateCache
>  * testImportCorruptWithCopying
>  * testImportCacheEnabledWithoutSrcDir
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/4199/workflows/a70b41d8-f848-4114-9349-9a01ac082281/jobs/223621/tests]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19572) Test failure: org.apache.cassandra.db.ImportTest flakiness



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840112#comment-17840112
 ] 

Stefan Miklosovic commented on CASSANDRA-19572:
---

I have clean 5k here 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/4223/workflows/a82d0483-a0df-44ed-8127-088b303c78ba/jobs/225432/steps

I put SSTableReader.resetTidying() into "after test", I noticed that in 
"after", there are still uncleared references and cleaning them up seem to help.

I will prepare other branches.

> Test failure: org.apache.cassandra.db.ImportTest flakiness
> --
>
> Key: CASSANDRA-19572
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19572
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Brandon Williams
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> As discovered on CASSANDRA-19401, the tests in this class are flaky, at least 
> the following:
>  * testImportCorruptWithoutValidationWithCopying
>  * testImportInvalidateCache
>  * testImportCorruptWithCopying
>  * testImportCacheEnabledWithoutSrcDir
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/4199/workflows/a70b41d8-f848-4114-9349-9a01ac082281/jobs/223621/tests]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19498) Error reading data from credential file

2024-04-23 Thread Brad Schoening (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840086#comment-17840086
 ] 

Brad Schoening commented on CASSANDRA-19498:


[~slavavrn] any suggestions regarding the test_legacy_auth.py?  No other pytest 
imports from bin.cqlsh.

> Error reading data from credential file
> ---
>
> Key: CASSANDRA-19498
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19498
> Project: Cassandra
>  Issue Type: Bug
>  Components: Documentation, Tool/cqlsh
>Reporter: Slava
>Priority: Normal
> Fix For: 4.1.x, 5.0.x, 5.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The pylib/cqlshlib/cqlshmain.py code reads data from the credentials file, 
> however, it is immediately ignored.
> https://github.com/apache/cassandra/blob/c9625e0102dab66f41d3ef2338c54d499e73a8c5/pylib/cqlshlib/cqlshmain.py#L2070
> {code:java}
>     if not options.username:
>         credentials = configparser.ConfigParser()
>         if options.credentials is not None:
>             credentials.read(options.credentials)        # use the username 
> from credentials file but fallback to cqlshrc if username is absent from the 
> command line parameters
>         options.username = username_from_cqlshrc    if not options.password:
>         rawcredentials = configparser.RawConfigParser()
>         if options.credentials is not None:
>             rawcredentials.read(options.credentials)        # handling 
> password in the same way as username, priority cli > credentials > cqlshrc
>         options.password = option_with_default(rawcredentials.get, 
> 'plain_text_auth', 'password', password_from_cqlshrc)
>         options.password = password_from_cqlshrc{code}
> These corrections have been made in accordance with 
> https://issues.apache.org/jira/browse/CASSANDRA-16983 and 
> https://issues.apache.org/jira/browse/CASSANDRA-16456.
> The documentation does not indicate that AuthProviders can be used in the 
> cqlshrc and credentials files.
> I propose to return the ability to use the legacy option of specifying the 
> user and password in the credentials file in the [plain_text_auth] section.
> It is also required to describe the rules for using the credentials file in 
> the documentation.
> I can make a corresponding pull request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-17667) Text value containing "/*" interpreted as multiline comment in cqlsh

2024-04-23 Thread Brad Schoening (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-17667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840085#comment-17840085
 ] 

Brad Schoening commented on CASSANDRA-17667:


[~brandon.williams]  thanks, it's not forgotten about it, it's next up after I 
close out CASSANDRA-19450 and 19498.

> Text value containing "/*" interpreted as multiline comment in cqlsh
> 
>
> Key: CASSANDRA-17667
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17667
> Project: Cassandra
>  Issue Type: Bug
>  Components: CQL/Interpreter
>Reporter: ANOOP THOMAS
>Assignee: Brad Schoening
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> I use CQLSH command line utility to load some DDLs. The version of utility I 
> use is this:
> {noformat}
> [cqlsh 6.0.0 | Cassandra 4.0.0.47 | CQL spec 3.4.5 | Native protocol 
> v5]{noformat}
> Command that loads DDL.cql:
> {noformat}
> cqlsh -u username -p password cassandra.example.com 65503 --ssl -f DDL.cql
> {noformat}
> I have a line in CQL script that breaks the syntax.
> {noformat}
> INSERT into tablename (key,columnname1,columnname2) VALUES 
> ('keyName','value1','/value2/*/value3');{noformat}
> {{/*}} here is interpreted as start of multi-line comment. It used to work on 
> older versions of cqlsh. The error I see looks like this:
> {noformat}
> SyntaxException: line 4:2 mismatched input 'Update' expecting ')' 
> (...,'value1','/value2INSERT into tablename(INSERT into tablename 
> (key,columnname1,columnname2)) VALUES ('[Update]-...) SyntaxException: line 
> 1:0 no viable alternative at input '(' ([(]...)
> {noformat}
> Same behavior while running in interactive mode too. {{/*}} inside a CQL 
> statement should not be interpreted as start of multi-line comment.
> With schema:
> {code:java}
> CREATE TABLE tablename ( key text primary key, columnname1 text, columnname2 
> text);{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19558) Standalone jenkinsfile first round bug fixes



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-19558:
---
Reviewers: Brandon Williams, Michael Semb Wever  (was: Brandon Williams)
   Status: Review In Progress  (was: Patch Available)

> Standalone jenkinsfile first round bug fixes
> 
>
> Key: CASSANDRA-19558
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19558
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 5.0.x, 5.x
>
> Attachments:  CASSANDRA-19558_50_#5_ci_summary.html,  
> CASSANDRA-19558_50_#5_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#13_ci_summary.html, 
> CASSANDRA-19558-5.0_#13_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#16_ci_summary.html, 
> CASSANDRA-19558-5.0_#16_results_details.tar.xz, 
> CASSANDRA-19558_#8_ci_summary.html, CASSANDRA-19558_#8_results_details.tar.xz
>
>
> A few follow up improvements and bug fixes for the standalone jenkinsfile.
> - add at top a list of test failures in ci_summary.html
> - docker scripts always try to login (as base images need to be pulled too)
> - move simulator-dtests to large containers (they need 8g just heap)
> - in ubuntu2004_test.docker make sure /home/cassandra exists and has correct 
> perms (from marcuse)
> - persist the jenkinsfile parameters from run to run (important for the 
> post-commit jobs to keep their non-default branch and profile values) (was 
> CASSANDRA-19536)
> - increase jvm-dtest splits from 8 to 12
> - when on ci-cassandra, replace use of copyArtifacts in Jenkinsfile 
> generateTestReports() with manual wget of test files, allowing the summary 
> phase to be run on any agent (copyArtifact would take >4hrs otherwise) (was 
> INFRA-25694)
> - copy ci_summary.html and results_details.tar.xz to nightlies



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19558) Standalone jenkinsfile first round bug fixes



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840060#comment-17840060
 ] 

Michael Semb Wever edited comment on CASSANDRA-19558 at 4/23/24 10:55 AM:
--

(1) is committed. 

wrt (2).

hit a few small problems.  INFRA-25738 being one (though unrelated to the 
patch). needed to make it work for the <5 jobs, and against different possible 
python versions (>=3.6).

these are fixed. i'm ready to commit (2) which is now 
[https://github.com/apache/cassandra-builds/compare/trunk...thelastpickle:cassandra-builds:mck/19558]
(excluding the throwaway commit, ofc)

 

Otherwise we have good test runs

in #3 to #7 in 
[https://ci-cassandra.apache.org/job/Cassandra-devbranch-before-5-artifacts/]

and in #588 to #594 in 
[https://ci-cassandra.apache.org/job/Cassandra-4.1-artifacts/] 

 

For (3) it is also needed ensuring it works when github is down (or goes down 
mid build)…

 


was (Author: michaelsembwever):
(1) is committed. 

wrt (2).

hit a few small problems.  INFRA-25738 being one (though unrelated to the 
patch). needed to make it work for the <5 jobs, and against different possible 
python versions (>=3.6).

these are fixed. i'm ready to commit (2) which is now 
https://github.com/apache/cassandra-builds/compare/trunk...thelastpickle:cassandra-builds:mck/19558

 

Otherwise we have good test runs

in #3 to #7 in 
[https://ci-cassandra.apache.org/job/Cassandra-devbranch-before-5-artifacts/]

and in #588 to #594 in 
[https://ci-cassandra.apache.org/job/Cassandra-4.1-artifacts/] 

 

For (3) it is also needed ensuring it works when github is down (or goes down 
mid build)…

 

> Standalone jenkinsfile first round bug fixes
> 
>
> Key: CASSANDRA-19558
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19558
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 5.0.x, 5.x
>
> Attachments:  CASSANDRA-19558_50_#5_ci_summary.html,  
> CASSANDRA-19558_50_#5_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#13_ci_summary.html, 
> CASSANDRA-19558-5.0_#13_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#16_ci_summary.html, 
> CASSANDRA-19558-5.0_#16_results_details.tar.xz, 
> CASSANDRA-19558_#8_ci_summary.html, CASSANDRA-19558_#8_results_details.tar.xz
>
>
> A few follow up improvements and bug fixes for the standalone jenkinsfile.
> - add at top a list of test failures in ci_summary.html
> - docker scripts always try to login (as base images need to be pulled too)
> - move simulator-dtests to large containers (they need 8g just heap)
> - in ubuntu2004_test.docker make sure /home/cassandra exists and has correct 
> perms (from marcuse)
> - persist the jenkinsfile parameters from run to run (important for the 
> post-commit jobs to keep their non-default branch and profile values) (was 
> CASSANDRA-19536)
> - increase jvm-dtest splits from 8 to 12
> - when on ci-cassandra, replace use of copyArtifacts in Jenkinsfile 
> generateTestReports() with manual wget of test files, allowing the summary 
> phase to be run on any agent (copyArtifact would take >4hrs otherwise) (was 
> INFRA-25694)
> - copy ci_summary.html and results_details.tar.xz to nightlies



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19558) Standalone jenkinsfile first round bug fixes



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-19558:
---
Status: Ready to Commit  (was: Review In Progress)

> Standalone jenkinsfile first round bug fixes
> 
>
> Key: CASSANDRA-19558
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19558
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 5.0.x, 5.x
>
> Attachments:  CASSANDRA-19558_50_#5_ci_summary.html,  
> CASSANDRA-19558_50_#5_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#13_ci_summary.html, 
> CASSANDRA-19558-5.0_#13_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#16_ci_summary.html, 
> CASSANDRA-19558-5.0_#16_results_details.tar.xz, 
> CASSANDRA-19558_#8_ci_summary.html, CASSANDRA-19558_#8_results_details.tar.xz
>
>
> A few follow up improvements and bug fixes for the standalone jenkinsfile.
> - add at top a list of test failures in ci_summary.html
> - docker scripts always try to login (as base images need to be pulled too)
> - move simulator-dtests to large containers (they need 8g just heap)
> - in ubuntu2004_test.docker make sure /home/cassandra exists and has correct 
> perms (from marcuse)
> - persist the jenkinsfile parameters from run to run (important for the 
> post-commit jobs to keep their non-default branch and profile values) (was 
> CASSANDRA-19536)
> - increase jvm-dtest splits from 8 to 12
> - when on ci-cassandra, replace use of copyArtifacts in Jenkinsfile 
> generateTestReports() with manual wget of test files, allowing the summary 
> phase to be run on any agent (copyArtifact would take >4hrs otherwise) (was 
> INFRA-25694)
> - copy ci_summary.html and results_details.tar.xz to nightlies



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19558) Standalone jenkinsfile first round bug fixes



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840060#comment-17840060
 ] 

Michael Semb Wever edited comment on CASSANDRA-19558 at 4/23/24 10:51 AM:
--

(1) is committed. 

wrt (2).

hit a few small problems.  INFRA-25738 being one (though unrelated to the 
patch). needed to make it work for the <5 jobs, and against different possible 
python versions (>=3.6).

these are fixed. i'm ready to commit (2) which is now 
https://github.com/apache/cassandra-builds/compare/trunk...thelastpickle:cassandra-builds:mck/19558

 

Otherwise we have good test runs

in #3 to #7 in 
[https://ci-cassandra.apache.org/job/Cassandra-devbranch-before-5-artifacts/]

and in #588 to #594 in 
[https://ci-cassandra.apache.org/job/Cassandra-4.1-artifacts/] 

 

For (3) it is also needed ensuring it works when github is down (or goes down 
mid build)…

 


was (Author: michaelsembwever):
(1) is committed. 

wrt (2).

hit a few small problems.  INFRA-25738 being one (though unrelated to the 
patch). needed to make it work for the <5 jobs, and against different possible 
python versions (>=3.6).

these are fixed. i'm ready to commit (2).

 

Otherwise we have good test runs

in #3 to #7 in 
[https://ci-cassandra.apache.org/job/Cassandra-devbranch-before-5-artifacts/]

and in #588 to #594 in 
[https://ci-cassandra.apache.org/job/Cassandra-4.1-artifacts/] 

 

For (3) it is also needed ensuring it works when github is down (or goes down 
mid build)…

 

> Standalone jenkinsfile first round bug fixes
> 
>
> Key: CASSANDRA-19558
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19558
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 5.0.x, 5.x
>
> Attachments:  CASSANDRA-19558_50_#5_ci_summary.html,  
> CASSANDRA-19558_50_#5_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#13_ci_summary.html, 
> CASSANDRA-19558-5.0_#13_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#16_ci_summary.html, 
> CASSANDRA-19558-5.0_#16_results_details.tar.xz, 
> CASSANDRA-19558_#8_ci_summary.html, CASSANDRA-19558_#8_results_details.tar.xz
>
>
> A few follow up improvements and bug fixes for the standalone jenkinsfile.
> - add at top a list of test failures in ci_summary.html
> - docker scripts always try to login (as base images need to be pulled too)
> - move simulator-dtests to large containers (they need 8g just heap)
> - in ubuntu2004_test.docker make sure /home/cassandra exists and has correct 
> perms (from marcuse)
> - persist the jenkinsfile parameters from run to run (important for the 
> post-commit jobs to keep their non-default branch and profile values) (was 
> CASSANDRA-19536)
> - increase jvm-dtest splits from 8 to 12
> - when on ci-cassandra, replace use of copyArtifacts in Jenkinsfile 
> generateTestReports() with manual wget of test files, allowing the summary 
> phase to be run on any agent (copyArtifact would take >4hrs otherwise) (was 
> INFRA-25694)
> - copy ci_summary.html and results_details.tar.xz to nightlies



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19558) Standalone jenkinsfile first round bug fixes



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840060#comment-17840060
 ] 

Michael Semb Wever edited comment on CASSANDRA-19558 at 4/23/24 10:50 AM:
--

(1) is committed. 

wrt (2).

hit a few small problems.  INFRA-25738 being one (though unrelated to the 
patch). needed to make it work for the <5 jobs, and against different possible 
python versions (>=3.6).

these are fixed. i'm ready to commit (2).

 

Otherwise we have good test runs

in #3 to #7 in 
[https://ci-cassandra.apache.org/job/Cassandra-devbranch-before-5-artifacts/]

and in #588 to #594 in 
[https://ci-cassandra.apache.org/job/Cassandra-4.1-artifacts/] 

 

For (3) it is also needed ensuring it works when github is down (or goes down 
mid build)…

 


was (Author: michaelsembwever):
(1) is committed. 

working on (2).

hit a few small problems.  INFRA-25738 being one (though unrelated to the 
patch). needed to make it work for the <5 jobs, and against different possible 
python versions (>=3.6), and ensuring it works when github is down (or goes 
down mid build)…

 

Otherwise we have good test runs

in #3 to #7 in 
[https://ci-cassandra.apache.org/job/Cassandra-devbranch-before-5-artifacts/]

and in #588 to #594 in 
[https://ci-cassandra.apache.org/job/Cassandra-4.1-artifacts/] 

 

> Standalone jenkinsfile first round bug fixes
> 
>
> Key: CASSANDRA-19558
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19558
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 5.0.x, 5.x
>
> Attachments:  CASSANDRA-19558_50_#5_ci_summary.html,  
> CASSANDRA-19558_50_#5_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#13_ci_summary.html, 
> CASSANDRA-19558-5.0_#13_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#16_ci_summary.html, 
> CASSANDRA-19558-5.0_#16_results_details.tar.xz, 
> CASSANDRA-19558_#8_ci_summary.html, CASSANDRA-19558_#8_results_details.tar.xz
>
>
> A few follow up improvements and bug fixes for the standalone jenkinsfile.
> - add at top a list of test failures in ci_summary.html
> - docker scripts always try to login (as base images need to be pulled too)
> - move simulator-dtests to large containers (they need 8g just heap)
> - in ubuntu2004_test.docker make sure /home/cassandra exists and has correct 
> perms (from marcuse)
> - persist the jenkinsfile parameters from run to run (important for the 
> post-commit jobs to keep their non-default branch and profile values) (was 
> CASSANDRA-19536)
> - increase jvm-dtest splits from 8 to 12
> - when on ci-cassandra, replace use of copyArtifacts in Jenkinsfile 
> generateTestReports() with manual wget of test files, allowing the summary 
> phase to be run on any agent (copyArtifact would take >4hrs otherwise) (was 
> INFRA-25694)
> - copy ci_summary.html and results_details.tar.xz to nightlies



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-19558) Standalone jenkinsfile first round bug fixes



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840060#comment-17840060
 ] 

Michael Semb Wever edited comment on CASSANDRA-19558 at 4/23/24 10:48 AM:
--

(1) is committed. 

working on (2).

hit a few small problems.  INFRA-25738 being one (though unrelated to the 
patch). needed to make it work for the <5 jobs, and against different possible 
python versions (>=3.6), and ensuring it works when github is down (or goes 
down mid build)…

 

Otherwise we have good test runs

in #3 to #7 in 
[https://ci-cassandra.apache.org/job/Cassandra-devbranch-before-5-artifacts/]

and in #588 to #594 in 
[https://ci-cassandra.apache.org/job/Cassandra-4.1-artifacts/] 

 


was (Author: michaelsembwever):
(1) is committed. 

working on (2).

hit a few small problems.  INFRA-25738 being one (though unrelated to the 
patch). needed to make it work for the <5 jobs, and against different possible 
python versions (>=3.6), and ensuring it works when github is down (or goes 
down mid build)…

 

> Standalone jenkinsfile first round bug fixes
> 
>
> Key: CASSANDRA-19558
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19558
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 5.0.x, 5.x
>
> Attachments:  CASSANDRA-19558_50_#5_ci_summary.html,  
> CASSANDRA-19558_50_#5_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#13_ci_summary.html, 
> CASSANDRA-19558-5.0_#13_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#16_ci_summary.html, 
> CASSANDRA-19558-5.0_#16_results_details.tar.xz, 
> CASSANDRA-19558_#8_ci_summary.html, CASSANDRA-19558_#8_results_details.tar.xz
>
>
> A few follow up improvements and bug fixes for the standalone jenkinsfile.
> - add at top a list of test failures in ci_summary.html
> - docker scripts always try to login (as base images need to be pulled too)
> - move simulator-dtests to large containers (they need 8g just heap)
> - in ubuntu2004_test.docker make sure /home/cassandra exists and has correct 
> perms (from marcuse)
> - persist the jenkinsfile parameters from run to run (important for the 
> post-commit jobs to keep their non-default branch and profile values) (was 
> CASSANDRA-19536)
> - increase jvm-dtest splits from 8 to 12
> - when on ci-cassandra, replace use of copyArtifacts in Jenkinsfile 
> generateTestReports() with manual wget of test files, allowing the summary 
> phase to be run on any agent (copyArtifact would take >4hrs otherwise) (was 
> INFRA-25694)
> - copy ci_summary.html and results_details.tar.xz to nightlies



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19558) Standalone jenkinsfile first round bug fixes



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840060#comment-17840060
 ] 

Michael Semb Wever commented on CASSANDRA-19558:


(1) is committed. 

working on (2).

hit a few small problems.  INFRA-25738 being one (though unrelated to the 
patch). needed to make it work for the <5 jobs, and against different possible 
python versions (>=3.6), and ensuring it works when github is down (or goes 
down mid build)…

 

> Standalone jenkinsfile first round bug fixes
> 
>
> Key: CASSANDRA-19558
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19558
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 5.0.x, 5.x
>
> Attachments:  CASSANDRA-19558_50_#5_ci_summary.html,  
> CASSANDRA-19558_50_#5_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#13_ci_summary.html, 
> CASSANDRA-19558-5.0_#13_results_details.tar.xz, 
> CASSANDRA-19558-5.0_#16_ci_summary.html, 
> CASSANDRA-19558-5.0_#16_results_details.tar.xz, 
> CASSANDRA-19558_#8_ci_summary.html, CASSANDRA-19558_#8_results_details.tar.xz
>
>
> A few follow up improvements and bug fixes for the standalone jenkinsfile.
> - add at top a list of test failures in ci_summary.html
> - docker scripts always try to login (as base images need to be pulled too)
> - move simulator-dtests to large containers (they need 8g just heap)
> - in ubuntu2004_test.docker make sure /home/cassandra exists and has correct 
> perms (from marcuse)
> - persist the jenkinsfile parameters from run to run (important for the 
> post-commit jobs to keep their non-default branch and profile values) (was 
> CASSANDRA-19536)
> - increase jvm-dtest splits from 8 to 12
> - when on ci-cassandra, replace use of copyArtifacts in Jenkinsfile 
> generateTestReports() with manual wget of test files, allowing the summary 
> phase to be run on any agent (copyArtifact would take >4hrs otherwise) (was 
> INFRA-25694)
> - copy ci_summary.html and results_details.tar.xz to nightlies



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18942) Repeatable java test runs on jenkins



[ 
https://issues.apache.org/jira/browse/CASSANDRA-18942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839853#comment-17839853
 ] 

Michael Semb Wever edited comment on CASSANDRA-18942 at 4/23/24 10:43 AM:
--

dependencies are all done, re-opening.

 

[~bereng] , can we rebase this, and make the changes in the existing scripts 
(i'll review and help with ensuring the changes don't break compat)


was (Author: michaelsembwever):
dependencies are all done, re-opening.

> Repeatable java test runs on jenkins
> 
>
> Key: CASSANDRA-18942
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18942
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Build, CI
>Reporter: Berenguer Blasi
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 5.0, 5.0.x
>
> Attachments: jenkins_job.xml, testJava.txt, testJavaDocker.txt, 
> testJavaSplits.txt
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> It is our policy to loop new introduced tests to avoid introducing flakies. 
> We also want to add the possibility to repeat a test N number of times to 
> test robustness, debug flakies, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Attachment: Scenario 2 - QUEUE + Backpressure.jpg
Scenario 2 - QUEUE.jpg
Scenario 2 - Stock.jpg

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Attachment: Scenario 1 - QUEUE.jpg
Scenario 1 - QUEUE + Backpressure.jpg
Scenario 1 - Stock.jpg

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

[
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840058#comment-17840058
]

Alex Petrov commented on CASSANDRA-19534:
-

The main change is the introduction of (currently implicit) configurable
{_}native request deadline{_}. No request, read or write, will be allowed to
prolong its execution beyond this deadline. Some of the hidden places that
would allow requests to stay overdue were local executor runnables,
replica-side writes, and hints. Default is 12 seconds, since this is how much
time 3.x driver (which I believe is still the most used version in the
community) waits until removing its handlers after which any response from the
server will just be ignored. Now, there is an _option_ to enable expiration
based on the queue time, which will be _disabled_ by default to preserve
existing semantics, but my tests have shown enabling it only has positive
effects. We will try it out cautiously in different clusters over the next
months and will see if tests match up with real loads before we change any of
the defaults.

So by default behaviour will be as follows:
# If request has spent more than 12 seconds in the NATIVE queue, we throw
Overloaded exception back to the client. This timeout used to be max of
read/write/range/counter rpc timeout.
# If requests has spent less than 12 seconds, it is allowed to execute; any
request issued by the coordinator can live:
## _either_ {{Verb.timeout}} number of milliseconds,
## _or_ up to the up to the native request deadline, as measured from the time
when the request was admitted to the coordinator's NATIVE queue, whichever one
of these is happening earlier.

Example 1, read timeout is 5 seconds:
# Client sends a request; request spends 6 seconds in the NATIVE queue
# Coordinator issues requests to replicas; two replicas respond within 3
seconds
# Coordinator responds to the client with success

Example 2, read timeout is 5 seconds:
# Client sends a request; request spends 6 seconds in the NATIVE queue
# Coordinator issues requests to replicas; one replica responds within 3
seconds; other replicas fail to respond within 5 seconds of read timeout
# Coordinator responds to the client with read timeout (preserves current
behaviour)

Example 3, read timeout is 5 seconds:
# Client sends a request; request spends 10 seconds in the NATIVE queue
# Coordinator issues requests to replicas; all replicas fail to respond within
2 seconds
# Coordinator responds to the client with read timeout; if messages are still
in queue on replicas, they will get dropped before processing

There will be a _new_ metric that shows how many of the timeouts would have
been “blind timeouts” previously. I.e. client _would_ register them as
timeouts, but we as server-side operators would be oblivious to them. This
metric will keep us collectively motivated even if we see there is a slight
uptick in timeouts after committing the patch.

Lastly, there is an option to say how much of the 12 seconds client requests
are allowed to spend in the native queue. You can say that if there is a client
request that has spent 80% of its max 12 seconds in the native queue, we start
applying backpressure to the client socket (or throwing overloaded exception,
depending on the value of {{{}native_transport_throw_on_overload{}}}). We have
to be careful with enabling this one, since my tests have shown that while we
see fewer timeouts server side, clients see more timeouts, because part of the
time they consider “request time” is now spent somewhere in TCP queues, which
we can not account for.
h3. New Configuration Params
h3. cql_start_time

Configures what is considered to be a base for the replica-side timeout. This
has actually existed before, it is now actually safe to enable. It still
defaults to {{REQUEST}} (processing start time is taken as a timeout base), and
an alternative is {{QUEUE}} (queue admission time is taken as a timeout base).
Unfortunately, there is no consistent view of the timeout base in the
community: some people think that server-side read/write timeouts are how much
time _replicas_ have to respond to coordinator. Some believe they mean how much
time _coordinator_ has to respond to the client. This patch is agnostic to
these beliefs.
h3. native_transport_throw_on_overload

Whether we should apply backpressure to client (i.e. stop reading from the
socket), or throw Overloaded exception. Default is socket backpressure, and
this is probably fine for now. In principle, this can also be set by the client
on per-connection basis via protocol options. However, 3.x series of the driver
do not have this addition implemented, so in practice this is not really used.
If used, setting from the client takes precedence.
h3. native_transport_timeout_in_ms

The absolute maximum amount of time the server has to respond to

[jira] [Updated] (CASSANDRA-19572) Test failure: org.apache.cassandra.db.ImportTest flakiness



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Miklosovic updated CASSANDRA-19572:
--
Description: 
As discovered on CASSANDRA-19401, the tests in this class are flaky, at least 
the following:
 * testImportCorruptWithoutValidationWithCopying
 * testImportInvalidateCache
 * testImportCorruptWithCopying
 * testImportCacheEnabledWithoutSrcDir

[https://app.circleci.com/pipelines/github/instaclustr/cassandra/4199/workflows/a70b41d8-f848-4114-9349-9a01ac082281/jobs/223621/tests]

  was:
As discovered on CASSANDRA-19401, the tests in this class are flaky, at least 
the following:
 * testImportCorruptWithoutValidationWithCopying
 * testImportInvalidateCache
 * testImportCorruptWithCopying
 * testImportCacheEnabledWithoutSrcDir
 * testImportInvalidateCache

[https://app.circleci.com/pipelines/github/instaclustr/cassandra/4199/workflows/a70b41d8-f848-4114-9349-9a01ac082281/jobs/223621/tests]


> Test failure: org.apache.cassandra.db.ImportTest flakiness
> --
>
> Key: CASSANDRA-19572
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19572
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Brandon Williams
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> As discovered on CASSANDRA-19401, the tests in this class are flaky, at least 
> the following:
>  * testImportCorruptWithoutValidationWithCopying
>  * testImportInvalidateCache
>  * testImportCorruptWithCopying
>  * testImportCacheEnabledWithoutSrcDir
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/4199/workflows/a70b41d8-f848-4114-9349-9a01ac082281/jobs/223621/tests]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19190) ForceSnapshot transformations should not be persisted in the local log table



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-19190:

Status: Ready to Commit  (was: Review In Progress)

> ForceSnapshot transformations should not be persisted in the local log table
> 
>
> Key: CASSANDRA-19190
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19190
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Marcus Eriksson
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-2.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Per its inline comments, ForceSnapshot is a synthetic transformation whose 
> purpose it to enable the local log to jump missing epochs. A common use for 
> this is when replaying persisted events from the metadata log at startup. The 
> log is initialised with {{Epoch.EMPTY}} but rather that replaying every 
> single entry since the beginning of history, we select the most recent 
> snapshot held locally and start the replay from that point. Likewise, when 
> catching up from a peer, a node may receive a snapshot plus subsequent log 
> entries. In order to bring local metadata to the same state as the snapshot, 
> a {{ForceSnapshot}} with the same epoch as the snapshot is inserted into the 
> {{LocalLog}} and enacted like any other other transformation. These synthetic 
> transformations should not be persisted in the `system.local_metadata_log`, 
> as they do not exist in the distributed metadata log. We _should_ persist the 
> snapshot itself in {{system.metadata_snapshots}} so that we can avoid having 
> to re-fetch remote snapshots (i.e. if a node were to restart shortly after 
> receiving a catchup from a peer).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19190) ForceSnapshot transformations should not be persisted in the local log table



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-19190:

Reviewers: Marcus Eriksson
   Status: Review In Progress  (was: Patch Available)

> ForceSnapshot transformations should not be persisted in the local log table
> 
>
> Key: CASSANDRA-19190
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19190
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Marcus Eriksson
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-2.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Per its inline comments, ForceSnapshot is a synthetic transformation whose 
> purpose it to enable the local log to jump missing epochs. A common use for 
> this is when replaying persisted events from the metadata log at startup. The 
> log is initialised with {{Epoch.EMPTY}} but rather that replaying every 
> single entry since the beginning of history, we select the most recent 
> snapshot held locally and start the replay from that point. Likewise, when 
> catching up from a peer, a node may receive a snapshot plus subsequent log 
> entries. In order to bring local metadata to the same state as the snapshot, 
> a {{ForceSnapshot}} with the same epoch as the snapshot is inserted into the 
> {{LocalLog}} and enacted like any other other transformation. These synthetic 
> transformations should not be persisted in the `system.local_metadata_log`, 
> as they do not exist in the distributed metadata log. We _should_ persist the 
> snapshot itself in {{system.metadata_snapshots}} so that we can avoid having 
> to re-fetch remote snapshots (i.e. if a node were to restart shortly after 
> receiving a catchup from a peer).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19190) ForceSnapshot transformations should not be persisted in the local log table



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-19190:

Source Control Link: 
https://github.com/apache/cassandra/commit/17ecece5437ab39aaeaa0eb4b42434cddd9960b5
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

> ForceSnapshot transformations should not be persisted in the local log table
> 
>
> Key: CASSANDRA-19190
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19190
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Marcus Eriksson
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-2.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Per its inline comments, ForceSnapshot is a synthetic transformation whose 
> purpose it to enable the local log to jump missing epochs. A common use for 
> this is when replaying persisted events from the metadata log at startup. The 
> log is initialised with {{Epoch.EMPTY}} but rather that replaying every 
> single entry since the beginning of history, we select the most recent 
> snapshot held locally and start the replay from that point. Likewise, when 
> catching up from a peer, a node may receive a snapshot plus subsequent log 
> entries. In order to bring local metadata to the same state as the snapshot, 
> a {{ForceSnapshot}} with the same epoch as the snapshot is inserted into the 
> {{LocalLog}} and enacted like any other other transformation. These synthetic 
> transformations should not be persisted in the `system.local_metadata_log`, 
> as they do not exist in the distributed metadata log. We _should_ persist the 
> snapshot itself in {{system.metadata_snapshots}} so that we can avoid having 
> to re-fetch remote snapshots (i.e. if a node were to restart shortly after 
> receiving a catchup from a peer).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

(cassandra) branch trunk updated: ForceSnapshot transformations should not be persisted in the local log table

2024-04-23 Thread marcuse

This is an automated email from the ASF dual-hosted git repository.

marcuse pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git


The following commit(s) were added to refs/heads/trunk by this push:
 new 17ecece543 ForceSnapshot transformations should not be persisted in 
the local log table
17ecece543 is described below

commit 17ecece5437ab39aaeaa0eb4b42434cddd9960b5
Author: Sam Tunnicliffe 
AuthorDate: Thu Dec 14 17:55:05 2023 +

ForceSnapshot transformations should not be persisted in the local log table

Patch by Sam Tunnicliffe; reviewed by marcuse for CASSANDRA-19190
---
 .../apache/cassandra/schema/DistributedSchema.java |  11 +-
 .../org/apache/cassandra/tcm/ClusterMetadata.java  |   2 +-
 .../cassandra/tcm/StubClusterMetadataService.java  |  83 -
 .../tcm/listeners/MetadataSnapshotListener.java|  10 +-
 .../org/apache/cassandra/tcm/log/LocalLog.java |   6 +-
 .../test/log/ClusterMetadataTestHelper.java|  19 ++-
 .../listeners/MetadataSnapshotListenerTest.java| 133 +
 .../org/apache/cassandra/tcm/log/LocalLogTest.java |  54 +
 8 files changed, 310 insertions(+), 8 deletions(-)

diff --git a/src/java/org/apache/cassandra/schema/DistributedSchema.java 
b/src/java/org/apache/cassandra/schema/DistributedSchema.java
index 86dd1d5117..a837b0773d 100644
--- a/src/java/org/apache/cassandra/schema/DistributedSchema.java
+++ b/src/java/org/apache/cassandra/schema/DistributedSchema.java
@@ -58,9 +58,16 @@ public class DistributedSchema implements 
MetadataValue
 return new DistributedSchema(Keyspaces.none(), Epoch.EMPTY);
 }
 
-public static DistributedSchema first()
+public static DistributedSchema first(Set knownDatacenters)
 {
-return new 
DistributedSchema(Keyspaces.of(DistributedMetadataLogKeyspace.initialMetadata(Collections.singleton(DatabaseDescriptor.getLocalDataCenter(,
 Epoch.FIRST);
+if (knownDatacenters.isEmpty())
+{
+if (DatabaseDescriptor.getLocalDataCenter() != null)
+knownDatacenters = 
Collections.singleton(DatabaseDescriptor.getLocalDataCenter());
+else
+knownDatacenters = Collections.singleton("DC1");
+}
+return new 
DistributedSchema(Keyspaces.of(DistributedMetadataLogKeyspace.initialMetadata(knownDatacenters)),
 Epoch.FIRST);
 }
 
 private final Keyspaces keyspaces;
diff --git a/src/java/org/apache/cassandra/tcm/ClusterMetadata.java 
b/src/java/org/apache/cassandra/tcm/ClusterMetadata.java
index 33886bec40..fdf4942c13 100644
--- a/src/java/org/apache/cassandra/tcm/ClusterMetadata.java
+++ b/src/java/org/apache/cassandra/tcm/ClusterMetadata.java
@@ -107,7 +107,7 @@ public class ClusterMetadata
 @VisibleForTesting
 public ClusterMetadata(IPartitioner partitioner, Directory directory)
 {
-this(partitioner, directory, DistributedSchema.first());
+this(partitioner, directory, 
DistributedSchema.first(directory.knownDatacenters()));
 }
 
 @VisibleForTesting
diff --git a/src/java/org/apache/cassandra/tcm/StubClusterMetadataService.java 
b/src/java/org/apache/cassandra/tcm/StubClusterMetadataService.java
index 475e8ef21b..8e191307d1 100644
--- a/src/java/org/apache/cassandra/tcm/StubClusterMetadataService.java
+++ b/src/java/org/apache/cassandra/tcm/StubClusterMetadataService.java
@@ -20,15 +20,24 @@ package org.apache.cassandra.tcm;
 
 import java.util.Collections;
 
+import com.google.common.collect.ImmutableMap;
+
 import org.apache.cassandra.config.DatabaseDescriptor;
+import org.apache.cassandra.dht.IPartitioner;
 import org.apache.cassandra.schema.DistributedMetadataLogKeyspace;
 import org.apache.cassandra.schema.DistributedSchema;
 import org.apache.cassandra.schema.KeyspaceMetadata;
 import org.apache.cassandra.schema.Keyspaces;
+import org.apache.cassandra.tcm.Commit.Replicator;
 import org.apache.cassandra.tcm.log.Entry;
 import org.apache.cassandra.tcm.log.LocalLog;
 import org.apache.cassandra.tcm.membership.Directory;
+import org.apache.cassandra.tcm.ownership.DataPlacements;
+import org.apache.cassandra.tcm.ownership.PlacementProvider;
+import org.apache.cassandra.tcm.ownership.TokenMap;
 import org.apache.cassandra.tcm.ownership.UniformRangePlacement;
+import org.apache.cassandra.tcm.sequences.InProgressSequences;
+import org.apache.cassandra.tcm.sequences.LockedRanges;
 
 public class StubClusterMetadataService extends ClusterMetadataService
 {
@@ -73,12 +82,24 @@ public class StubClusterMetadataService extends 
ClusterMetadataService
   .withInitialState(initial)
   .createLog(),
   new StubProcessor(),
-  Commit.Replicator.NO_OP,
+  Replicator.NO_OP,
   false);
 this.metadata = initial;
 this.log().readyUnchecked();
 }
 
+private StubClusterMetadataService(PlacementProvider

[jira] [Commented] (CASSANDRA-19190) ForceSnapshot transformations should not be persisted in the local log table



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840045#comment-17840045
 ] 

Marcus Eriksson commented on CASSANDRA-19190:
-

attaching new ci run, two failures, CASSANDRA-17339 and a counter mismatch, so 
I'm +1 here, will get it committed

> ForceSnapshot transformations should not be persisted in the local log table
> 
>
> Key: CASSANDRA-19190
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19190
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Marcus Eriksson
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-2.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Per its inline comments, ForceSnapshot is a synthetic transformation whose 
> purpose it to enable the local log to jump missing epochs. A common use for 
> this is when replaying persisted events from the metadata log at startup. The 
> log is initialised with {{Epoch.EMPTY}} but rather that replaying every 
> single entry since the beginning of history, we select the most recent 
> snapshot held locally and start the replay from that point. Likewise, when 
> catching up from a peer, a node may receive a snapshot plus subsequent log 
> entries. In order to bring local metadata to the same state as the snapshot, 
> a {{ForceSnapshot}} with the same epoch as the snapshot is inserted into the 
> {{LocalLog}} and enacted like any other other transformation. These synthetic 
> transformations should not be persisted in the `system.local_metadata_log`, 
> as they do not exist in the distributed metadata log. We _should_ persist the 
> snapshot itself in {{system.metadata_snapshots}} so that we can avoid having 
> to re-fetch remote snapshots (i.e. if a node were to restart shortly after 
> receiving a catchup from a peer).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19190) ForceSnapshot transformations should not be persisted in the local log table



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-19190:

Attachment: (was: ci_summary-1.html)

> ForceSnapshot transformations should not be persisted in the local log table
> 
>
> Key: CASSANDRA-19190
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19190
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Marcus Eriksson
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-2.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Per its inline comments, ForceSnapshot is a synthetic transformation whose 
> purpose it to enable the local log to jump missing epochs. A common use for 
> this is when replaying persisted events from the metadata log at startup. The 
> log is initialised with {{Epoch.EMPTY}} but rather that replaying every 
> single entry since the beginning of history, we select the most recent 
> snapshot held locally and start the replay from that point. Likewise, when 
> catching up from a peer, a node may receive a snapshot plus subsequent log 
> entries. In order to bring local metadata to the same state as the snapshot, 
> a {{ForceSnapshot}} with the same epoch as the snapshot is inserted into the 
> {{LocalLog}} and enacted like any other other transformation. These synthetic 
> transformations should not be persisted in the `system.local_metadata_log`, 
> as they do not exist in the distributed metadata log. We _should_ persist the 
> snapshot itself in {{system.metadata_snapshots}} so that we can avoid having 
> to re-fetch remote snapshots (i.e. if a node were to restart shortly after 
> receiving a catchup from a peer).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19190) ForceSnapshot transformations should not be persisted in the local log table



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-19190:

Attachment: (was: ci_summary.html)

> ForceSnapshot transformations should not be persisted in the local log table
> 
>
> Key: CASSANDRA-19190
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19190
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Marcus Eriksson
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-2.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Per its inline comments, ForceSnapshot is a synthetic transformation whose 
> purpose it to enable the local log to jump missing epochs. A common use for 
> this is when replaying persisted events from the metadata log at startup. The 
> log is initialised with {{Epoch.EMPTY}} but rather that replaying every 
> single entry since the beginning of history, we select the most recent 
> snapshot held locally and start the replay from that point. Likewise, when 
> catching up from a peer, a node may receive a snapshot plus subsequent log 
> entries. In order to bring local metadata to the same state as the snapshot, 
> a {{ForceSnapshot}} with the same epoch as the snapshot is inserted into the 
> {{LocalLog}} and enacted like any other other transformation. These synthetic 
> transformations should not be persisted in the `system.local_metadata_log`, 
> as they do not exist in the distributed metadata log. We _should_ persist the 
> snapshot itself in {{system.metadata_snapshots}} so that we can avoid having 
> to re-fetch remote snapshots (i.e. if a node were to restart shortly after 
> receiving a catchup from a peer).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19190) ForceSnapshot transformations should not be persisted in the local log table



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-19190:

Attachment: ci_summary-2.html

> ForceSnapshot transformations should not be persisted in the local log table
> 
>
> Key: CASSANDRA-19190
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19190
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Marcus Eriksson
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-2.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Per its inline comments, ForceSnapshot is a synthetic transformation whose 
> purpose it to enable the local log to jump missing epochs. A common use for 
> this is when replaying persisted events from the metadata log at startup. The 
> log is initialised with {{Epoch.EMPTY}} but rather that replaying every 
> single entry since the beginning of history, we select the most recent 
> snapshot held locally and start the replay from that point. Likewise, when 
> catching up from a peer, a node may receive a snapshot plus subsequent log 
> entries. In order to bring local metadata to the same state as the snapshot, 
> a {{ForceSnapshot}} with the same epoch as the snapshot is inserted into the 
> {{LocalLog}} and enacted like any other other transformation. These synthetic 
> transformations should not be persisted in the `system.local_metadata_log`, 
> as they do not exist in the distributed metadata log. We _should_ persist the 
> snapshot itself in {{system.metadata_snapshots}} so that we can avoid having 
> to re-fetch remote snapshots (i.e. if a node were to restart shortly after 
> receiving a catchup from a peer).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

(cassandra-website) branch asf-staging updated (916d1569 -> a9e9af59)

2024-04-23 Thread git-site-role

This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a change to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/cassandra-website.git


 discard 916d1569 generate docs for cc1c7113
 new a9e9af59 generate docs for cc1c7113

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (916d1569)
\
 N -- N -- N   refs/heads/asf-staging (a9e9af59)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 site-ui/build/ui-bundle.zip | Bin 4883646 -> 4883646 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19572) Test failure: org.apache.cassandra.db.ImportTest flakiness



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840013#comment-17840013
 ] 

Stefan Miklosovic commented on CASSANDRA-19572:
---

[~marcuse] There is already something about "releasing" SSTables here, I wonder 
what was your thought process behind that as that is the part of the 
functionality where it is failing. What's the context?

(1) 
https://github.com/apache/cassandra/blob/cassandra-4.0/test/unit/org/apache/cassandra/db/ImportTest.java#L235

> Test failure: org.apache.cassandra.db.ImportTest flakiness
> --
>
> Key: CASSANDRA-19572
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19572
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Brandon Williams
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> As discovered on CASSANDRA-19401, the tests in this class are flaky, at least 
> the following:
>  * testImportCorruptWithoutValidationWithCopying
>  * testImportInvalidateCache
>  * testImportCorruptWithCopying
>  * testImportCacheEnabledWithoutSrcDir
>  * testImportInvalidateCache
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/4199/workflows/a70b41d8-f848-4114-9349-9a01ac082281/jobs/223621/tests]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19572) Test failure: org.apache.cassandra.db.ImportTest flakiness



[ 
https://issues.apache.org/jira/browse/CASSANDRA-19572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839994#comment-17839994
 ] 

Marcus Eriksson commented on CASSANDRA-19572:
-

sorry, don't remember seeing these errors

> Test failure: org.apache.cassandra.db.ImportTest flakiness
> --
>
> Key: CASSANDRA-19572
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19572
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/bulk load
>Reporter: Brandon Williams
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> As discovered on CASSANDRA-19401, the tests in this class are flaky, at least 
> the following:
>  * testImportCorruptWithoutValidationWithCopying
>  * testImportInvalidateCache
>  * testImportCorruptWithCopying
>  * testImportCacheEnabledWithoutSrcDir
>  * testImportInvalidateCache
> [https://app.circleci.com/pipelines/github/instaclustr/cassandra/4199/workflows/a70b41d8-f848-4114-9349-9a01ac082281/jobs/223621/tests]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19191) Optimisations to PlacementForRange, improve lookup on r/w path



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-19191:

Source Control Link: 
https://github.com/apache/cassandra/commit/34d999c47a4da6d43a67910354fb9888184b23ab
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

and committed, thanks

> Optimisations to PlacementForRange, improve lookup on r/w path
> --
>
> Key: CASSANDRA-19191
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19191
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-1.html, ci_summary.html, result_details.tar.gz
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The lookup used when selecting the appropriate replica group for a range or 
> token while peforming reads and writes is extremely simplistic and 
> inefficient. There is plenty of scope to improve {{PlacementsForRange}} to by 
> replacing the current naive iteration with use a more efficient lookup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

(cassandra) branch trunk updated: Optimisations to PlacementForRange, improve lookup on r/w path

2024-04-23 Thread marcuse

This is an automated email from the ASF dual-hosted git repository.

marcuse pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git


The following commit(s) were added to refs/heads/trunk by this push:
 new 34d999c47a Optimisations to PlacementForRange, improve lookup on r/w 
path
34d999c47a is described below

commit 34d999c47a4da6d43a67910354fb9888184b23ab
Author: Marcus Eriksson 
AuthorDate: Wed Mar 20 15:53:50 2024 +0100

Optimisations to PlacementForRange, improve lookup on r/w path

Patch by marcuse and Sam Tunnicliffe; reviewed by Sam Tunnicliffe for 
CASSANDRA-19191

Co-authored-by: Sam Tunnicliffe 
Co-authored-by: Marcus Eriksson 
---
 .../apache/cassandra/locator/LocalStrategy.java|   6 +-
 .../cassandra/locator/NetworkTopologyStrategy.java |   6 +-
 .../apache/cassandra/locator/SimpleStrategy.java   |   6 +-
 .../org/apache/cassandra/tcm/ClusterMetadata.java  |  16 +-
 .../cassandra/tcm/ownership/DataPlacement.java |  68 +++
 .../cassandra/tcm/ownership/DataPlacements.java|  14 +-
 .../{PlacementForRange.java => ReplicaGroups.java} | 206 +
 .../org/apache/cassandra/tcm/sequences/Move.java   |   4 +-
 .../cassandra/tcm/sequences/RemoveNodeStreams.java |   4 +-
 .../cassandra/distributed/shared/ClusterUtils.java |   4 +-
 .../test/log/MetadataChangeSimulationTest.java |  26 +--
 .../test/log/OperationalEquivalenceTest.java   |   4 +-
 .../distributed/test/log/SimulatedOperation.java   |   4 +-
 .../distributed/test/ring/RangeVersioningTest.java |   4 +-
 .../test/microbench/ReplicaGroupsBench.java| 138 ++
 .../tcm/compatibility/GossipHelperTest.java|   6 +-
 .../tcm/ownership/UniformRangePlacementTest.java   |  68 ---
 .../InProgressSequenceCancellationTest.java|  18 +-
 .../cassandra/tcm/sequences/SequencesUtils.java|   2 +-
 19 files changed, 392 insertions(+), 212 deletions(-)

diff --git a/src/java/org/apache/cassandra/locator/LocalStrategy.java 
b/src/java/org/apache/cassandra/locator/LocalStrategy.java
index 69193090c4..4032ce1594 100644
--- a/src/java/org/apache/cassandra/locator/LocalStrategy.java
+++ b/src/java/org/apache/cassandra/locator/LocalStrategy.java
@@ -26,7 +26,7 @@ import org.apache.cassandra.dht.Token;
 import org.apache.cassandra.tcm.ClusterMetadata;
 import org.apache.cassandra.tcm.Epoch;
 import org.apache.cassandra.tcm.ownership.DataPlacement;
-import org.apache.cassandra.tcm.ownership.PlacementForRange;
+import org.apache.cassandra.tcm.ownership.ReplicaGroups;
 import org.apache.cassandra.tcm.ownership.VersionedEndpoints;
 import org.apache.cassandra.utils.FBUtilities;
 
@@ -65,7 +65,7 @@ public class LocalStrategy extends SystemStrategy
 {
 public static final Range entireRange = new 
Range<>(DatabaseDescriptor.getPartitioner().getMinimumToken(), 
DatabaseDescriptor.getPartitioner().getMinimumToken());
 public static final EndpointsForRange localReplicas = 
EndpointsForRange.of(new Replica(FBUtilities.getBroadcastAddressAndPort(), 
entireRange, true));
-public static final DataPlacement placement = new 
DataPlacement(PlacementForRange.builder().withReplicaGroup(VersionedEndpoints.forRange(Epoch.FIRST,
 localReplicas)).build(),
-
PlacementForRange.builder().withReplicaGroup(VersionedEndpoints.forRange(Epoch.FIRST,
 localReplicas)).build());
+public static final DataPlacement placement = new 
DataPlacement(ReplicaGroups.builder().withReplicaGroup(VersionedEndpoints.forRange(Epoch.FIRST,
 localReplicas)).build(),
+
ReplicaGroups.builder().withReplicaGroup(VersionedEndpoints.forRange(Epoch.FIRST,
 localReplicas)).build());
 }
 }
diff --git a/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java 
b/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java
index d48ee31610..05bfcfb9ed 100644
--- a/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java
+++ b/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java
@@ -49,7 +49,7 @@ import org.apache.cassandra.tcm.membership.Directory;
 import org.apache.cassandra.tcm.membership.Location;
 import org.apache.cassandra.tcm.membership.NodeId;
 import org.apache.cassandra.tcm.ownership.DataPlacement;
-import org.apache.cassandra.tcm.ownership.PlacementForRange;
+import org.apache.cassandra.tcm.ownership.ReplicaGroups;
 import org.apache.cassandra.tcm.ownership.TokenMap;
 import org.apache.cassandra.tcm.ownership.VersionedEndpoints;
 import org.apache.cassandra.utils.FBUtilities;
@@ -194,7 +194,7 @@ public class NetworkTopologyStrategy extends 
AbstractReplicationStrategy
  Directory directory,
  TokenMap tokenMap)
 {
-PlacementForRange.Builder

Re: [PR] Minor fix to unit test [cassandra-java-driver]



absurdfarce merged PR #1930:
URL: https://github.com/apache/cassandra-java-driver/pull/1930


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

(cassandra-java-driver) branch 4.x updated: Initial fix to unit tests

2024-04-23 Thread absurdfarce

This is an automated email from the ASF dual-hosted git repository.

absurdfarce pushed a commit to branch 4.x
in repository https://gitbox.apache.org/repos/asf/cassandra-java-driver.git


The following commit(s) were added to refs/heads/4.x by this push:
 new 07265b4a6 Initial fix to unit tests
07265b4a6 is described below

commit 07265b4a6830a47752bf31eb4f631b9917863da2
Author: absurdfarce 
AuthorDate: Tue Apr 23 00:38:48 2024 -0500

Initial fix to unit tests

patch by Bret McGuire; reviewed by Bret McGuire for PR 1930
---
 .../oss/driver/internal/core/session/DefaultSession.java  | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git 
a/core/src/main/java/com/datastax/oss/driver/internal/core/session/DefaultSession.java
 
b/core/src/main/java/com/datastax/oss/driver/internal/core/session/DefaultSession.java
index cb1271c9c..6f063ae9a 100644
--- 
a/core/src/main/java/com/datastax/oss/driver/internal/core/session/DefaultSession.java
+++ 
b/core/src/main/java/com/datastax/oss/driver/internal/core/session/DefaultSession.java
@@ -39,6 +39,7 @@ import 
com.datastax.oss.driver.internal.core.metadata.MetadataManager;
 import 
com.datastax.oss.driver.internal.core.metadata.MetadataManager.RefreshSchemaResult;
 import com.datastax.oss.driver.internal.core.metadata.NodeStateEvent;
 import com.datastax.oss.driver.internal.core.metadata.NodeStateManager;
+import com.datastax.oss.driver.internal.core.metrics.NodeMetricUpdater;
 import com.datastax.oss.driver.internal.core.metrics.SessionMetricUpdater;
 import com.datastax.oss.driver.internal.core.pool.ChannelPool;
 import com.datastax.oss.driver.internal.core.util.Loggers;
@@ -549,10 +550,11 @@ public class DefaultSession implements CqlSession {
 
   // clear metrics to prevent memory leak
   for (Node n : metadataManager.getMetadata().getNodes().values()) {
-((DefaultNode) n).getMetricUpdater().clearMetrics();
+NodeMetricUpdater updater = ((DefaultNode) n).getMetricUpdater();
+if (updater != null) updater.clearMetrics();
   }
 
-  DefaultSession.this.metricUpdater.clearMetrics();
+  if (metricUpdater != null) metricUpdater.clearMetrics();
 
   List> childrenCloseStages = new ArrayList<>();
   for (AsyncAutoCloseable closeable : internalComponentsToClose()) {
@@ -575,10 +577,11 @@ public class DefaultSession implements CqlSession {
 
   // clear metrics to prevent memory leak
   for (Node n : metadataManager.getMetadata().getNodes().values()) {
-((DefaultNode) n).getMetricUpdater().clearMetrics();
+NodeMetricUpdater updater = ((DefaultNode) n).getMetricUpdater();
+if (updater != null) updater.clearMetrics();
   }
 
-  DefaultSession.this.metricUpdater.clearMetrics();
+  if (metricUpdater != null) metricUpdater.clearMetrics();
 
   if (closeWasCalled) {
 // onChildrenClosed has already been scheduled


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[PR] Minor fix to unit test [cassandra-java-driver]