[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145048#comment-17145048 ] Sergio Bossa commented on CASSANDRA-15700: -- [~aleksey] apologies for this late reply. I've pushed the recommended changes, please have a look when you have a moment. > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Labels: pull-request-available > Fix For: 4.0-beta > > Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, > oss40_nogc.tar.xz, oss40_system.log > > Time Spent: 10m > Remaining Estimate: 0h > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {noformat} > (tprint (top-aggregated-calls oss40nogc > "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 > org/apache/cassandra/net/PrunableArrayQueue:prune 1894 > org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 > org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 > org/apache/cassandra/net/OutboundConnection:onExpired 147 > {noformat} > And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: > {noformat} > (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5)) > org/apache/cassandra/net/PrunableArrayQueue:prune 1718 > org/apache/cassandra/net/OutboundConnection:releaseCapacity 27 > java/util/concurrent/ConcurrentHashMap:replaceNode 19 > java/util/concurrent/ConcurrentLinkedQueue:offer 16 > java/util/concurrent/LinkedBlockingQueue:offer 15 > {noformat} > That said, before proceeding with a PR to fix those issues, I'd like to > understand: what's the reason to prune so often, rather than just when > polling the message during delivery? If there's a reason I'm missing, let's > talk about how to optimize pruning, otherwise let's get rid of that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15665) StreamManager should clearly differentiate between "initiator" and "receiver" sessions
[ https://issues.apache.org/jira/browse/CASSANDRA-15665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133127#comment-17133127 ] Sergio Bossa commented on CASSANDRA-15665: -- [~jasonstack], [~blerer], apologies for this late reply. I'm afraid I'll be busy until at least next week. Feel free to go ahead without my review. > StreamManager should clearly differentiate between "initiator" and "receiver" > sessions > -- > > Key: CASSANDRA-15665 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15665 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: ZhaoYang >Priority: Normal > Fix For: 4.0 > > > {{StreamManager}} does currently a suboptimal job in differentiating between > stream sessions (in form of {{StreamResultFuture}}) which have been either > initiated or "received", for the following reasons: > 1) Naming is IMO confusing: a "receiver" session could actually both send and > receive files, so technically an initiator is also a receiver. > 2) {{StreamManager#findSession()}} assumes we should first looking into > "initiator" sessions, then into "receiver" ones: this is a dangerous > assumptions, in particular for test environments where the same process could > work as both an initiator and a receiver. > I would recommend the following changes: > 1) Rename "receiver" with "follower" everywhere the former is used. > 2) Introduce a new flag into {{StreamMessageHeader}} to signal if the message > comes from an initiator or follower session, in order to correctly > differentiate and look for sessions in {{StreamManager}}. > While my arguments above might seem trivial, I believe they will improve > clarity and save from potential bugs/headaches at testing time, and doing > such changes now that we're revamping streaming for 4.0 seems the right time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125211#comment-17125211 ] Sergio Bossa commented on CASSANDRA-15700: -- [~aleksey] thanks for reviewing: getting back to this after two months and having to rebuild the whole thing in my mind, made me realize how complex this is, and that I should have put a couple more comments, so apologies and kudos for getting through it all alone :) That said, to answer your question straight: the race you point out is absolutely correct, and was unfortunately overlooked by myself, but there should be an easy fix. Before going with that, I think I have to clarify first why I introduced the new {{nextExpirationDeadline}} variable along {{earliestExpiryTime}}. Simply put, if we only track time via {{earliestExpiryTime}} during both {{add}} and {{prune}}, we risk getting into a race where we accumulate an unbounded number of messages until the next expiration, as shown in the following scenario: 1. {{add}} is called N times with {{earliestExpiryTime}} set as the minimum time among the added messages; you can't adjust in this case by the current time otherwise you would never expire (as the time would always shift). 2. {{prune}} is called and {{earliestExpiryTime}} is set as the minimum between the minimum time among the pruned messages and the current expiry time. 3. This means that any messages arrived between the start and end of the pruning whose expiry time was not the minimum, but still less than the minimum expiry time among pruned messages, would be "ignored" and remain in the queue. At this point you might say it really sounds like the race you discovered and in a way it is :) Although in the above case the race window would be much larger (the whole {{prune}}). Anyway, as I said there should be an easy fix: the deadline should be updated only if it's actually the minimum value (adjusted by current time), and I've sent a new commit with such fix. I don't see a way to fix the same kind of races by just keeping a single variable, but let me know if you find any. > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Labels: pull-request-available > Fix For: 4.0-beta > > Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, > oss40_nogc.tar.xz, oss40_system.log > > Time Spent: 10m > Remaining Estimate: 0h > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {
[jira] [Commented] (CASSANDRA-15666) Race condition when completing stream sessions
[ https://issues.apache.org/jira/browse/CASSANDRA-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17087966#comment-17087966 ] Sergio Bossa commented on CASSANDRA-15666: -- Good to merge for me. > Race condition when completing stream sessions > -- > > Key: CASSANDRA-15666 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15666 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: ZhaoYang >Priority: Normal > Labels: pull-request-available > Fix For: 4.0 > > Time Spent: 10m > Remaining Estimate: 0h > > {{StreamSession#prepareAsync()}} executes, as the name implies, > asynchronously from the IO thread: this opens up for race conditions between > the sending of the {{PrepareSynAckMessage}} and the call to > {{StreamSession#maybeCompleted()}}. I.e., the following could happen: > 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread. > 2) Node B receives it and starts streaming. > 3) Node A receives the streamed file and sends {{ReceivedMessage}}. > 4) At this point, if this was the only file to stream, both nodes are ready > to close the session via {{maybeCompleted()}}, but: > a) Node A will call it twice from both the IO thread and the thread at #1, > closing the session and its channels. > b) Node B will attempt to send a {{CompleteMessage}}, but will fail because > the session has been closed in the meantime. > There are other subtle variations of the pattern above, depending on the > order of concurrently sent/received messages. > I believe the best fix would be to modify the message exchange so that: > 1) Only the "follower" is allowed to send the {{CompleteMessage}}. > 2) Only the "initiator" is allowed to close the session and its channels > after receiving the {{CompleteMessage}}. > By doing so, the message exchange logic would be easier to reason about, > which is overall a win anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-13606) Improve handling of 2i initialization failures
[ https://issues.apache.org/jira/browse/CASSANDRA-13606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa reassigned CASSANDRA-13606: Assignee: (was: Sergio Bossa) > Improve handling of 2i initialization failures > -- > > Key: CASSANDRA-13606 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13606 > Project: Cassandra > Issue Type: Improvement > Components: Feature/2i Index >Reporter: Sergio Bossa >Priority: Normal > Fix For: 4.0 > > > CASSANDRA-10130 fixes the 2i build management, but initialization failures > are still not properly handled, most notably because: > * Initialization failures make the index non-queryable, but it can still be > written to. > * Initialization failures can be recovered via full rebuilds. > Both points above are probably suboptimal because the initialization logic > could be more complex than just an index build, hence it shouldn't be made > recoverable via a simple rebuild, and could cause the index to be fully > unavailable not just for reads, but for writes as well. > So, we should better handle initialization failures by: > * Allowing the index implementation to specify if unavailable for reads, > writes, or both. > * Providing a proper method to recover, distinct from index rebuilds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-10130: - Reviewers: Paulo Motta, Sergio Bossa (was: Paulo Motta) > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Coordination >Reporter: Yuki Morishita >Assignee: Andres de la Peña >Priority: Low > Fix For: 4.0 > > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083146#comment-17083146 ] Sergio Bossa commented on CASSANDRA-15667: -- [~maxtomassi], [~e.dimitrova], thanks for your PRs. I'm +1 to both. > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Massimiliano Tomassi >Priority: Normal > Fix For: 4.0 > > Attachments: log_bootstrap_resumable > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15666) Race condition when completing stream sessions
[ https://issues.apache.org/jira/browse/CASSANDRA-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079527#comment-17079527 ] Sergio Bossa commented on CASSANDRA-15666: -- [~jasonstack] thanks for the followups, sent another round of comments. > Race condition when completing stream sessions > -- > > Key: CASSANDRA-15666 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15666 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: ZhaoYang >Priority: Normal > Fix For: 4.0 > > > {{StreamSession#prepareAsync()}} executes, as the name implies, > asynchronously from the IO thread: this opens up for race conditions between > the sending of the {{PrepareSynAckMessage}} and the call to > {{StreamSession#maybeCompleted()}}. I.e., the following could happen: > 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread. > 2) Node B receives it and starts streaming. > 3) Node A receives the streamed file and sends {{ReceivedMessage}}. > 4) At this point, if this was the only file to stream, both nodes are ready > to close the session via {{maybeCompleted()}}, but: > a) Node A will call it twice from both the IO thread and the thread at #1, > closing the session and its channels. > b) Node B will attempt to send a {{CompleteMessage}}, but will fail because > the session has been closed in the meantime. > There are other subtle variations of the pattern above, depending on the > order of concurrently sent/received messages. > I believe the best fix would be to modify the message exchange so that: > 1) Only the "follower" is allowed to send the {{CompleteMessage}}. > 2) Only the "initiator" is allowed to close the session and its channels > after receiving the {{CompleteMessage}}. > By doing so, the message exchange logic would be easier to reason about, > which is overall a win anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079258#comment-17079258 ] Sergio Bossa commented on CASSANDRA-15667: -- Thanks [~e.dimitrova] for chiming in. {quote}From what I recall the bootstrap was sometimes completing too fast before the streaming is really interrupted from the byteman code and we didn't really have an instrument to control that. {quote} I went through the "resumable bootstrap" test, and unfortunately I don't see how the bootstrap could ever complete before the byteman script is invoked: this is because such script [makes node 1 fail before it starts to stream files|[https://github.com/ekaterinadimitrova2/cassandra-dtest/blob/b56887d67c353d6d69cd60cfd74859405fa37685/byteman/4.0/stream_failure.btm#L10]], which means there's no way for node 3 to finish bootstrapping before it received all files from both nodes, which will never happen due to said script causing node 1 to fail. So why did the test fail? I believe that's because of this issue: in other words, node 3 was correctly seeing its streaming session completed (after node 1 finished streaming with an error) but *not* failed; this is because the "completed" state is read through the actual session state, while the "failed" state is read through the {{SessionInfo}} state, which is what we're fixing here. That said, I would propose to still re-introduce the original {{resumable_bootstrap_test}}, because it's an important enough feature to deserve its own test, and it uses 3 nodes which increases the chances of detecting errors/races. Thoughts? > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Massimiliano Tomassi >Priority: Normal > Fix For: 4.0 > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15700: - Source Control Link: https://github.com/apache/cassandra/pull/531 > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Labels: pull-request-available > Fix For: 4.0-beta > > Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, > oss40_nogc.tar.xz, oss40_system.log > > Time Spent: 10m > Remaining Estimate: 0h > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {noformat} > (tprint (top-aggregated-calls oss40nogc > "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 > org/apache/cassandra/net/PrunableArrayQueue:prune 1894 > org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 > org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 > org/apache/cassandra/net/OutboundConnection:onExpired 147 > {noformat} > And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: > {noformat} > (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5)) > org/apache/cassandra/net/PrunableArrayQueue:prune 1718 > org/apache/cassandra/net/OutboundConnection:releaseCapacity 27 > java/util/concurrent/ConcurrentHashMap:replaceNode 19 > java/util/concurrent/ConcurrentLinkedQueue:offer 16 > java/util/concurrent/LinkedBlockingQueue:offer 15 > {noformat} > That said, before proceeding with a PR to fix those issues, I'd like to > understand: what's the reason to prune so often, rather than just when > polling the message during delivery? If there's a reason I'm missing, let's > talk about how to optimize pruning, otherwise let's get rid of that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15700: - Test and Documentation Plan: See attached performance test plots. Also see unit tests. Status: Patch Available (was: In Progress) > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Labels: pull-request-available > Fix For: 4.0-beta > > Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, > oss40_nogc.tar.xz, oss40_system.log > > Time Spent: 10m > Remaining Estimate: 0h > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {noformat} > (tprint (top-aggregated-calls oss40nogc > "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 > org/apache/cassandra/net/PrunableArrayQueue:prune 1894 > org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 > org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 > org/apache/cassandra/net/OutboundConnection:onExpired 147 > {noformat} > And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: > {noformat} > (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5)) > org/apache/cassandra/net/PrunableArrayQueue:prune 1718 > org/apache/cassandra/net/OutboundConnection:releaseCapacity 27 > java/util/concurrent/ConcurrentHashMap:replaceNode 19 > java/util/concurrent/ConcurrentLinkedQueue:offer 16 > java/util/concurrent/LinkedBlockingQueue:offer 15 > {noformat} > That said, before proceeding with a PR to fix those issues, I'd like to > understand: what's the reason to prune so often, rather than just when > polling the message during delivery? If there's a reason I'm missing, let's > talk about how to optimize pruning, otherwise let's get rid of that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079208#comment-17079208 ] Sergio Bossa commented on CASSANDRA-15700: -- Added some more thorough unit tests in the meantime. > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Labels: pull-request-available > Fix For: 4.0-beta > > Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, > oss40_nogc.tar.xz, oss40_system.log > > Time Spent: 10m > Remaining Estimate: 0h > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {noformat} > (tprint (top-aggregated-calls oss40nogc > "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 > org/apache/cassandra/net/PrunableArrayQueue:prune 1894 > org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 > org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 > org/apache/cassandra/net/OutboundConnection:onExpired 147 > {noformat} > And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: > {noformat} > (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5)) > org/apache/cassandra/net/PrunableArrayQueue:prune 1718 > org/apache/cassandra/net/OutboundConnection:releaseCapacity 27 > java/util/concurrent/ConcurrentHashMap:replaceNode 19 > java/util/concurrent/ConcurrentLinkedQueue:offer 16 > java/util/concurrent/LinkedBlockingQueue:offer 15 > {noformat} > That said, before proceeding with a PR to fix those issues, I'd like to > understand: what's the reason to prune so often, rather than just when > polling the message during delivery? If there's a reason I'm missing, let's > talk about how to optimize pruning, otherwise let's get rid of that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15700: - Authors: Sergio Bossa (was: Sergio Bossa, ZhaoYang) > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Fix For: 4.0-beta > > Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, > oss40_nogc.tar.xz, oss40_system.log > > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {noformat} > (tprint (top-aggregated-calls oss40nogc > "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 > org/apache/cassandra/net/PrunableArrayQueue:prune 1894 > org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 > org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 > org/apache/cassandra/net/OutboundConnection:onExpired 147 > {noformat} > And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: > {noformat} > (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5)) > org/apache/cassandra/net/PrunableArrayQueue:prune 1718 > org/apache/cassandra/net/OutboundConnection:releaseCapacity 27 > java/util/concurrent/ConcurrentHashMap:replaceNode 19 > java/util/concurrent/ConcurrentLinkedQueue:offer 16 > java/util/concurrent/LinkedBlockingQueue:offer 15 > {noformat} > That said, before proceeding with a PR to fix those issues, I'd like to > understand: what's the reason to prune so often, rather than just when > polling the message during delivery? If there's a reason I'm missing, let's > talk about how to optimize pruning, otherwise let's get rid of that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079102#comment-17079102 ] Sergio Bossa commented on CASSANDRA-15700: -- Thank you both. > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Fix For: 4.0-beta > > Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, > oss40_nogc.tar.xz, oss40_system.log > > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {noformat} > (tprint (top-aggregated-calls oss40nogc > "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 > org/apache/cassandra/net/PrunableArrayQueue:prune 1894 > org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 > org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 > org/apache/cassandra/net/OutboundConnection:onExpired 147 > {noformat} > And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: > {noformat} > (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5)) > org/apache/cassandra/net/PrunableArrayQueue:prune 1718 > org/apache/cassandra/net/OutboundConnection:releaseCapacity 27 > java/util/concurrent/ConcurrentHashMap:replaceNode 19 > java/util/concurrent/ConcurrentLinkedQueue:offer 16 > java/util/concurrent/LinkedBlockingQueue:offer 15 > {noformat} > That said, before proceeding with a PR to fix those issues, I'd like to > understand: what's the reason to prune so often, rather than just when > polling the message during delivery? If there's a reason I'm missing, let's > talk about how to optimize pruning, otherwise let's get rid of that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078510#comment-17078510 ] Sergio Bossa commented on CASSANDRA-15700: -- {quote}Honestly though it sounds like there is not much between our proposals now, which is reassuring, so perhaps we should focus on the common ground we have. {quote} Indeed, I think we agree on the necessity to fix this in the least risky way, which it seems to mean keeping the current pruning implementation and accuracy, but avoiding it to be run *at every single delivery* (although maybe you dispute this being the problem, but the collapsed stacks speak soundly about that). In the spirit of that, I've fixed the message queue algorithm to compute the expiration deadline in a way that can be used to actually run the pruning task _only after such deadline_. I'll give it another review on my own and possibly add more unit tests (please note the current implementation seems to had none at all), but performance tests now look much better (orange is patched 4.0): !Oss40patchedvsOss311.png|width=556,height=299! Here's the branch: [https://github.com/sbtourist/cassandra/commits/CASSANDRA-15700] I understand there's no _panic_ to fix it, and you'll be away the next couple weeks, but this means realistically postponing this issue for at least 3 weeks, which will add to the delay we're already accumulating in 4.0 for other reasons, so maybe you could delegate this to [~aleksey] as you mentioned? > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Fix For: 4.0-beta > > Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, > oss40_nogc.tar.xz, oss40_system.log > > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {noformat} > (tprint (top-aggregated-calls oss40nogc > "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 > org/apache/cassandra/net/PrunableArrayQueue:prune 1894 > org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 > org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 > org/apache/cassandra/net/OutboundConnection:onExpired 147 > {noformat} > And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: > {noformat} > (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5)) > org/apache/cassandra/net/PrunableArrayQueue:prune 1718 > org/apache/cassandra/net/Outbou
[jira] [Updated] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15700: - Attachment: Oss40patchedvsOss311.png > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Fix For: 4.0-beta > > Attachments: Oss40patchedvsOss311.png, Oss40vsOss311.png, oss40.gc, > oss40_nogc.tar.xz, oss40_system.log > > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {noformat} > (tprint (top-aggregated-calls oss40nogc > "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 > org/apache/cassandra/net/PrunableArrayQueue:prune 1894 > org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 > org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 > org/apache/cassandra/net/OutboundConnection:onExpired 147 > {noformat} > And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: > {noformat} > (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5)) > org/apache/cassandra/net/PrunableArrayQueue:prune 1718 > org/apache/cassandra/net/OutboundConnection:releaseCapacity 27 > java/util/concurrent/ConcurrentHashMap:replaceNode 19 > java/util/concurrent/ConcurrentLinkedQueue:offer 16 > java/util/concurrent/LinkedBlockingQueue:offer 15 > {noformat} > That said, before proceeding with a PR to fix those issues, I'd like to > understand: what's the reason to prune so often, rather than just when > polling the message during delivery? If there's a reason I'm missing, let's > talk about how to optimize pruning, otherwise let's get rid of that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078042#comment-17078042 ] Sergio Bossa commented on CASSANDRA-15700: -- {quote}This would be a regression from 3.0 unfortunately {quote} Oh that's a good point, forgot we do that in 3.0+ too. We can keep the {{enqueue()}} pruning, as that's not the worst offender (see collapsed stacks). {quote}I assume you mean to maintain a guess of the number of items we _on average_ have in the queue? {quote} No, I meant to compute the "next expire time" as an approximation of the expire time of the processed messages, rather than relying on exactly computing it via the pruner at every event loop delivery run. {quote}One of the features running through this work is that we make strong guarantees, and this would weaken that. I would prefer to be able to stipulate that (e.g.) we never reject an {{enqueue}} if more than 20% of the queue is already expired. {quote} How is that related to what we're discussing? Enqueuing new messages is regulated by memory limiting, so that's what gives us strong guarantees; also, we do already prune the backlog if memory limits are met. {quote}messing with important control flow semantics is something I would prefer to avoid. {quote} I'd still like to understand how the pruning approach we're discussing here is important to the control flow semantics at all, as I don't think I've got a clear answer yet, although it might be me missing the point. What I've heard/understood is: 1) It protects us against saturating memory upon network stalls. 2) It protects us against saturating memory upon too many expired messages. AFAIU, none of those is accurate, as the current implementation doesn't satisfy #1, and #2 is covered by the memory limits implementation. {quote}I think the least risky approach is to change how we compute and select the expiration time we use for triggering an expiration. This also has the benefit of maintaining well-define guarantees, is simple, and modifies no behaviours besides the selection of this value. {quote} This is something that can definitely be tried first, to reduce the amount of pruner runs. I will test this next. {quote}More specifically (e.g.), you pick a time {{t}}, such that you expect the set of messages with expiration less than {{t}}, say {{M( Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Fix For: 4.0 > > Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, > oss40_system.log > > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > s
[jira] [Updated] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15667: - Reviewers: Sergio Bossa, ZhaoYang, Sergio Bossa (was: Sergio Bossa, ZhaoYang) Sergio Bossa, ZhaoYang, Sergio Bossa (was: Sergio Bossa, ZhaoYang) Status: Review In Progress (was: Patch Available) +1 [~maxtomassi] (and [~e.dimitrova]), I think we should reintroduce {{TestBootstrap.resumable_bootstrap_test}} and check if it was actually failing due to this race: what do you think? > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Massimiliano Tomassi >Priority: Normal > Fix For: 4.0 > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15666) Race condition when completing stream sessions
[ https://issues.apache.org/jira/browse/CASSANDRA-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077478#comment-17077478 ] Sergio Bossa commented on CASSANDRA-15666: -- {quote}Let's see what [~blerer] has to say {quote} Let's not delay this fix further, unless [~blerer] really wants to chime in? > Race condition when completing stream sessions > -- > > Key: CASSANDRA-15666 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15666 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: ZhaoYang >Priority: Normal > Fix For: 4.0 > > > {{StreamSession#prepareAsync()}} executes, as the name implies, > asynchronously from the IO thread: this opens up for race conditions between > the sending of the {{PrepareSynAckMessage}} and the call to > {{StreamSession#maybeCompleted()}}. I.e., the following could happen: > 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread. > 2) Node B receives it and starts streaming. > 3) Node A receives the streamed file and sends {{ReceivedMessage}}. > 4) At this point, if this was the only file to stream, both nodes are ready > to close the session via {{maybeCompleted()}}, but: > a) Node A will call it twice from both the IO thread and the thread at #1, > closing the session and its channels. > b) Node B will attempt to send a {{CompleteMessage}}, but will fail because > the session has been closed in the meantime. > There are other subtle variations of the pattern above, depending on the > order of concurrently sent/received messages. > I believe the best fix would be to modify the message exchange so that: > 1) Only the "follower" is allowed to send the {{CompleteMessage}}. > 2) Only the "initiator" is allowed to close the session and its channels > after receiving the {{CompleteMessage}}. > By doing so, the message exchange logic would be easier to reason about, > which is overall a win anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077469#comment-17077469 ] Sergio Bossa commented on CASSANDRA-15700: -- {quote}So, I'm actually misremembering: though I had thought we scheduled exactly what you suggest {quote} It's ok, I do also forget most stuff past the 24hrs cut-off :) On a more serious note, I get your point about the need to be precise VS amortising costs, but wondering what data we have to help with such decision, if any. We definitely have evidence that the _current_ implementation massively affects performance, and given the need to stabilise the codebase upon the 4.0 release, I would personally opt for a fix which doesn't increase overall complexity, and rather tries to strike a balance between said complexity and desired accuracy and performance, along the lines of: 1) Remove pruning and expire time tracking from enqueue. 2) Remove pruning from the event loop delivery run. 3) During event loop delivery: a) Compute a weighted average of each message expire time. b) If potentially stalling (flushing over the high water mark), schedule a background pruning using the computed expire time. c) If resuming from stalling before the pruning deadline, cancel it (because we'll keep pruning as we process). This should be reasonably easy to implement, have decent accuracy, and perform well, as the expire deadline would be computed on the single threaded event loop (no contention) and the pruner would only run if actually stalling (which should *not* be the case most of the time). How does that sound? Am I missing anything? > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Fix For: 4.0 > > Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, > oss40_system.log > > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {noformat} > (tprint (top-aggregated-calls oss40nogc > "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 > org/apache/cassandra/net/PrunableArrayQueue:prune 1894 > org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 > org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 > org/apache/cassandra/net/OutboundConnection:onExpired 147 > {noformat} > And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: > {noformat} > (tprint (top-self-calls oss40nogc "PrunableArr
[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077309#comment-17077309 ] Sergio Bossa commented on CASSANDRA-15700: -- {quote}Delivery will only be scheduled at enqueue if it is not already processing outstanding work, and though iirc the enqueue operation will itself forcefully prune in some circumstances, there are anyway no guarantees that new messages will be arriving to this connection. {quote} Correct. {quote}Delivery will only reschedule itself once the outbound TCP channel has room, and so in a network outage that does not terminate the connection or stalled recipient process (for example), where it is making no (or very slow) progress, it will not be woken up to process further pending messages. {quote} Isn't this a problem regardless, given the current implementation? That is, we currently prune upon: 1) Receiving a new message. 2) Finishing processing messages in an event loop run. As we said, we have no guarantee we'll keep receiving messages, which rules #1 out, and if the event loop is stalling (i.e. due to network issues or whatever) we'll not run the message processing loop anyway, which rules #2 out and leaves us with the same problem of leaving expired messages in the queue. If this really is the problem we want to solve, which is a bit of an edge case but fine, something as easy as scheduling a background pruning at the earliest expiry date before "stalling" (i.e. before setting the connection as unwritable) would do it (possibly using a Hashed Wheel Timer because the Netty scheduler sucks a bit at scheduled tasks). > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Fix For: 4.0 > > Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, > oss40_system.log > > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {noformat} > (tprint (top-aggregated-calls oss40nogc > "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 > org/apache/cassandra/net/PrunableArrayQueue:prune 1894 > org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 > org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 > org/apache/cassandra/net/OutboundConnection:onExpired 147 > {noformat} > And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: > {noformat} > (tprint (top-self-calls oss40nogc "PrunableArrayQueue:p
[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077179#comment-17077179 ] Sergio Bossa commented on CASSANDRA-15700: -- {quote}This is the only way (presently) to know what deadline the eventLoop needs to wake up on, to ensure we do not let expiring messages accumulate indefinitely {quote} I'm not sure I follow, it would be great if you could elaborate more, maybe with some pointers on the code. AFAIU, the "event loop" (assuming you mean the {{EventLoopDelivery}} class) is potentially scheduled at every new message, then rescheduling itself based on the number of pending bytes and the writable state; this means, as long as there are pending bytes, and even if no new messages arrive, it will keep rescheduling itself and getting rid of expired messages along the way, so I don't see how they could accumulate indefinitely, although I might very well missing something. Or you just mean such "rescheduling to expire" is simply inefficient? > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Fix For: 4.0 > > Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, > oss40_system.log > > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {noformat} > (tprint (top-aggregated-calls oss40nogc > "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 > org/apache/cassandra/net/PrunableArrayQueue:prune 1894 > org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 > org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 > org/apache/cassandra/net/OutboundConnection:onExpired 147 > {noformat} > And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: > {noformat} > (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5)) > org/apache/cassandra/net/PrunableArrayQueue:prune 1718 > org/apache/cassandra/net/OutboundConnection:releaseCapacity 27 > java/util/concurrent/ConcurrentHashMap:replaceNode 19 > java/util/concurrent/ConcurrentLinkedQueue:offer 16 > java/util/concurrent/LinkedBlockingQueue:offer 15 > {noformat} > That said, before proceeding with a PR to fix those issues, I'd like to > understand: what's the reason to prune so often, rather than just when > polling the message during delivery? If there's a reason I'm missing, let's > talk about how to optimize pruning, otherwise let's get rid of that. -- This messag
[jira] [Updated] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15700: - Authors: Sergio Bossa, ZhaoYang (was: Sergio Bossa) > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Fix For: 4.0 > > Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, > oss40_system.log > > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {noformat} > (tprint (top-aggregated-calls oss40nogc > "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 > org/apache/cassandra/net/PrunableArrayQueue:prune 1894 > org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 > org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 > org/apache/cassandra/net/OutboundConnection:onExpired 147 > {noformat} > And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: > {noformat} > (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5)) > org/apache/cassandra/net/PrunableArrayQueue:prune 1718 > org/apache/cassandra/net/OutboundConnection:releaseCapacity 27 > java/util/concurrent/ConcurrentHashMap:replaceNode 19 > java/util/concurrent/ConcurrentLinkedQueue:offer 16 > java/util/concurrent/LinkedBlockingQueue:offer 15 > {noformat} > That said, before proceeding with a PR to fix those issues, I'd like to > understand: what's the reason to prune so often, rather than just when > polling the message during delivery? If there's a reason I'm missing, let's > talk about how to optimize pruning, otherwise let's get rid of that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15700) Performance regression on internode messaging
[ https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15700: - Fix Version/s: 4.0 > Performance regression on internode messaging > - > > Key: CASSANDRA-15700 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Normal > Fix For: 4.0 > > Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, > oss40_system.log > > > Me and [~jasonstack] have been investigating a performance regression > affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries > like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: > !Oss40vsOss311.png|width=389,height=214! > It's been a bit of a long investigation, but two clues ended up standing out: > 1) An abnormal number of expired messages on 4.0 (as shown in the attached > system log), while 3.11 has almost none. > 2) An abnormal GC activity (as shown in the attached gc log). > Turns out the two are related, as the [on expired > callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] > creates a huge amount of strings in the {{id()}} call. The next question is > what causes all those message expirations; we thoroughly reviewed the > internode messaging code and the only issue we could find so far is related > to the "batch pruning" calls > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] > and > [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: > it _seems_ too much time is spent on those, causing the event loop to fall > behind in processing the rest of the messages, which will end up being > expired. This is supported by the analysis of the collapsed stacks (after > fixing the GC issue): > {noformat} > (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) > org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 > org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 > org/apache/cassandra/net/PrunableArrayQueue:prune 1621 > org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 > {noformat} > Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which > spends half of its time pruning. But only a tiny portion of such pruning time > is spent actually expiring: > {noformat} > (tprint (top-aggregated-calls oss40nogc > "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) > org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 > org/apache/cassandra/net/PrunableArrayQueue:prune 1894 > org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 > org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 > org/apache/cassandra/net/OutboundConnection:onExpired 147 > {noformat} > And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: > {noformat} > (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5)) > org/apache/cassandra/net/PrunableArrayQueue:prune 1718 > org/apache/cassandra/net/OutboundConnection:releaseCapacity 27 > java/util/concurrent/ConcurrentHashMap:replaceNode 19 > java/util/concurrent/ConcurrentLinkedQueue:offer 16 > java/util/concurrent/LinkedBlockingQueue:offer 15 > {noformat} > That said, before proceeding with a PR to fix those issues, I'd like to > understand: what's the reason to prune so often, rather than just when > polling the message during delivery? If there's a reason I'm missing, let's > talk about how to optimize pruning, otherwise let's get rid of that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15700) Performance regression on internode messaging
Sergio Bossa created CASSANDRA-15700: Summary: Performance regression on internode messaging Key: CASSANDRA-15700 URL: https://issues.apache.org/jira/browse/CASSANDRA-15700 Project: Cassandra Issue Type: Bug Components: Messaging/Internode Reporter: Sergio Bossa Assignee: Sergio Bossa Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, oss40_system.log Me and [~jasonstack] have been investigating a performance regression affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries like workload, as shown in this plot, where blue is 3.11 and orange is 4.0: !Oss40vsOss311.png|width=389,height=214! It's been a bit of a long investigation, but two clues ended up standing out: 1) An abnormal number of expired messages on 4.0 (as shown in the attached system log), while 3.11 has almost none. 2) An abnormal GC activity (as shown in the attached gc log). Turns out the two are related, as the [on expired callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462] creates a huge amount of strings in the {{id()}} call. The next question is what causes all those message expirations; we thoroughly reviewed the internode messaging code and the only issue we could find so far is related to the "batch pruning" calls [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81] and [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]: it _seems_ too much time is spent on those, causing the event loop to fall behind in processing the rest of the messages, which will end up being expired. This is supported by the analysis of the collapsed stacks (after fixing the GC issue): {noformat} (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5)) org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456 org/apache/cassandra/net/OutboundMessageQueue:access$600 1621 org/apache/cassandra/net/PrunableArrayQueue:prune 1621 org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621 org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620 {noformat} Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which spends half of its time pruning. But only a tiny portion of such pruning time is spent actually expiring: {noformat} (tprint (top-aggregated-calls oss40nogc "OutboundMessageQueue:pruneInternalQueueWithLock" 5)) org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900 org/apache/cassandra/net/PrunableArrayQueue:prune 1894 org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147 org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147 org/apache/cassandra/net/OutboundConnection:onExpired 147 {noformat} And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant: {noformat} (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5)) org/apache/cassandra/net/PrunableArrayQueue:prune 1718 org/apache/cassandra/net/OutboundConnection:releaseCapacity 27 java/util/concurrent/ConcurrentHashMap:replaceNode 19 java/util/concurrent/ConcurrentLinkedQueue:offer 16 java/util/concurrent/LinkedBlockingQueue:offer 15 {noformat} That said, before proceeding with a PR to fix those issues, I'd like to understand: what's the reason to prune so often, rather than just when polling the message during delivery? If there's a reason I'm missing, let's talk about how to optimize pruning, otherwise let's get rid of that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15666) Race condition when completing stream sessions
[ https://issues.apache.org/jira/browse/CASSANDRA-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073831#comment-17073831 ] Sergio Bossa commented on CASSANDRA-15666: -- [~jasonstack] your fix looks good, just left a few minor comments on the PR. Regarding the changes to the {{CompleteMessage}} exchange, I still think that'd be a win regardless if the race is fixed in a different way, as the current implementation makes it harder to reason about its correctness (which means it could be prone to other races), but I also understand we want to limit the scope of changes to 4.0, so not pushing hard for it. > Race condition when completing stream sessions > -- > > Key: CASSANDRA-15666 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15666 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: ZhaoYang >Priority: Normal > Fix For: 4.0 > > > {{StreamSession#prepareAsync()}} executes, as the name implies, > asynchronously from the IO thread: this opens up for race conditions between > the sending of the {{PrepareSynAckMessage}} and the call to > {{StreamSession#maybeCompleted()}}. I.e., the following could happen: > 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread. > 2) Node B receives it and starts streaming. > 3) Node A receives the streamed file and sends {{ReceivedMessage}}. > 4) At this point, if this was the only file to stream, both nodes are ready > to close the session via {{maybeCompleted()}}, but: > a) Node A will call it twice from both the IO thread and the thread at #1, > closing the session and its channels. > b) Node B will attempt to send a {{CompleteMessage}}, but will fail because > the session has been closed in the meantime. > There are other subtle variations of the pattern above, depending on the > order of concurrently sent/received messages. > I believe the best fix would be to modify the message exchange so that: > 1) Only the "follower" is allowed to send the {{CompleteMessage}}. > 2) Only the "initiator" is allowed to close the session and its channels > after receiving the {{CompleteMessage}}. > By doing so, the message exchange logic would be easier to reason about, > which is overall a win anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15666) Race condition when completing stream sessions
[ https://issues.apache.org/jira/browse/CASSANDRA-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15666: - Reviewers: Benjamin Lerer, Sergio Bossa, Sergio Bossa (was: Benjamin Lerer, Sergio Bossa) Benjamin Lerer, Sergio Bossa, Sergio Bossa (was: Benjamin Lerer, Sergio Bossa) Status: Review In Progress (was: Patch Available) > Race condition when completing stream sessions > -- > > Key: CASSANDRA-15666 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15666 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: ZhaoYang >Priority: Normal > Fix For: 4.0 > > > {{StreamSession#prepareAsync()}} executes, as the name implies, > asynchronously from the IO thread: this opens up for race conditions between > the sending of the {{PrepareSynAckMessage}} and the call to > {{StreamSession#maybeCompleted()}}. I.e., the following could happen: > 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread. > 2) Node B receives it and starts streaming. > 3) Node A receives the streamed file and sends {{ReceivedMessage}}. > 4) At this point, if this was the only file to stream, both nodes are ready > to close the session via {{maybeCompleted()}}, but: > a) Node A will call it twice from both the IO thread and the thread at #1, > closing the session and its channels. > b) Node B will attempt to send a {{CompleteMessage}}, but will fail because > the session has been closed in the meantime. > There are other subtle variations of the pattern above, depending on the > order of concurrently sent/received messages. > I believe the best fix would be to modify the message exchange so that: > 1) Only the "follower" is allowed to send the {{CompleteMessage}}. > 2) Only the "initiator" is allowed to close the session and its channels > after receiving the {{CompleteMessage}}. > By doing so, the message exchange logic would be easier to reason about, > which is overall a win anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15667: - Reviewers: Sergio Bossa, ZhaoYang (was: ZhaoYang) > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Benjamin Lerer >Priority: Normal > Fix For: 4.0 > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15666) Race condition when completing stream sessions
[ https://issues.apache.org/jira/browse/CASSANDRA-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068663#comment-17068663 ] Sergio Bossa commented on CASSANDRA-15666: -- bq. it's still possible to send 2 CompleteMessage by follower when maybeComplete() in prepareAsync() is delayed and race with maybeComplete() in taskCompleted(). Correct. I agree the overall thread safety of {{StreamSession}} should be fixed. > Race condition when completing stream sessions > -- > > Key: CASSANDRA-15666 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15666 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: ZhaoYang >Priority: Normal > > {{StreamSession#prepareAsync()}} executes, as the name implies, > asynchronously from the IO thread: this opens up for race conditions between > the sending of the {{PrepareSynAckMessage}} and the call to > {{StreamSession#maybeCompleted()}}. I.e., the following could happen: > 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread. > 2) Node B receives it and starts streaming. > 3) Node A receives the streamed file and sends {{ReceivedMessage}}. > 4) At this point, if this was the only file to stream, both nodes are ready > to close the session via {{maybeCompleted()}}, but: > a) Node A will call it twice from both the IO thread and the thread at #1, > closing the session and its channels. > b) Node B will attempt to send a {{CompleteMessage}}, but will fail because > the session has been closed in the meantime. > There are other subtle variations of the pattern above, depending on the > order of concurrently sent/received messages. > I believe the best fix would be to modify the message exchange so that: > 1) Only the "follower" is allowed to send the {{CompleteMessage}}. > 2) Only the "initiator" is allowed to close the session and its channels > after receiving the {{CompleteMessage}}. > By doing so, the message exchange logic would be easier to reason about, > which is overall a win anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15375) Remove BackPressureStrategy
[ https://issues.apache.org/jira/browse/CASSANDRA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15375: - Reviewers: Sergio Bossa, Sergio Bossa (was: Sergio Bossa) Sergio Bossa, Sergio Bossa (was: Sergio Bossa) Status: Review In Progress (was: Patch Available) [~benedict], apologies for this late review. I've added some comments to your commit here: https://github.com/belliottsmith/cassandra/commit/3584b4305ab87f836fdc94d4d5b28bd102642ccb > Remove BackPressureStrategy > --- > > Key: CASSANDRA-15375 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15375 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client, Observability/Logging >Reporter: Jon Haddad >Assignee: Jon Haddad >Priority: Low > > This is odd: > {{INFO [main] 2019-10-25 10:33:07,985 DatabaseDescriptor.java:803 - > Back-pressure is disabled with strategy > org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, > flow=FAST}.}} > When I saw that, I wasn't sure if back pressure was actually disabled, or if > I was really using {{RateBasedBackPressure.}} > This should change to output either: > {{Back-pressure is disabled}} > {{or}} > {{Back-pressure is enabled with strategy > org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, > flow=FAST}.}}{{}} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15375) Remove BackPressureStrategy
[ https://issues.apache.org/jira/browse/CASSANDRA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15375: - Fix Version/s: 4.0 > Remove BackPressureStrategy > --- > > Key: CASSANDRA-15375 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15375 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client, Observability/Logging >Reporter: Jon Haddad >Assignee: Jon Haddad >Priority: Low > Fix For: 4.0 > > > This is odd: > {{INFO [main] 2019-10-25 10:33:07,985 DatabaseDescriptor.java:803 - > Back-pressure is disabled with strategy > org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, > flow=FAST}.}} > When I saw that, I wasn't sure if back pressure was actually disabled, or if > I was really using {{RateBasedBackPressure.}} > This should change to output either: > {{Back-pressure is disabled}} > {{or}} > {{Back-pressure is enabled with strategy > org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, > flow=FAST}.}}{{}} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15375) Remove BackPressureStrategy
[ https://issues.apache.org/jira/browse/CASSANDRA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15375: - Reviewers: Sergio Bossa > Remove BackPressureStrategy > --- > > Key: CASSANDRA-15375 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15375 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client, Observability/Logging >Reporter: Jon Haddad >Assignee: Jon Haddad >Priority: Low > > This is odd: > {{INFO [main] 2019-10-25 10:33:07,985 DatabaseDescriptor.java:803 - > Back-pressure is disabled with strategy > org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, > flow=FAST}.}} > When I saw that, I wasn't sure if back pressure was actually disabled, or if > I was really using {{RateBasedBackPressure.}} > This should change to output either: > {{Back-pressure is disabled}} > {{or}} > {{Back-pressure is enabled with strategy > org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, > flow=FAST}.}}{{}} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15667: - Fix Version/s: 4.0 > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Priority: Normal > Fix For: 4.0 > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
Sergio Bossa created CASSANDRA-15667: Summary: StreamResultFuture check for completeness is inconsistent, leading to races Key: CASSANDRA-15667 URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 Project: Cassandra Issue Type: Bug Components: Legacy/Streaming and Messaging Reporter: Sergio Bossa {{StreamResultFuture#maybeComplete()}} uses {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are completed, but then accesses each session state via {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the former relies on the actual {{StreamSession}} state, while the latter on the {{SessionInfo}} state, and the two are concurrently updated with no coordination whatsoever. This leads to races, i.e. apparent in some dtest spurious failures, such as {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14754) Add verification of state machine in StreamSession
[ https://issues.apache.org/jira/browse/CASSANDRA-14754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068579#comment-17068579 ] Sergio Bossa commented on CASSANDRA-14754: -- Thanks [~jasonstack]. Also see CASSANDRA-15667. > Add verification of state machine in StreamSession > -- > > Key: CASSANDRA-14754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14754 > Project: Cassandra > Issue Type: Task > Components: Legacy/Streaming and Messaging >Reporter: Jason Brown >Assignee: ZhaoYang >Priority: Normal > Fix For: 4.x > > > {{StreamSession}} contains an implicit state machine, but we have no > verification of the safety of the transitions between states. For example, we > have no checks to ensure we cannot leave the final states (COMPLETED, FAILED). > I propose we add some program logic in {{StreamSession}}, tests, and > documentation to ensure the correctness of the state transitions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14754) Add verification of state machine in StreamSession
[ https://issues.apache.org/jira/browse/CASSANDRA-14754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17067826#comment-17067826 ] Sergio Bossa commented on CASSANDRA-14754: -- I think just validating state transitions doesn't really add much: for example, moving from {{INITIALIZED}} to {{COMPLETE}} is [valid|https://github.com/apache/cassandra/pull/480/files#diff-b397cc5438b1a4be20f026c9ec9ecd6eR215], but only if there's nothing to actually stream, which is not expressed in the current validation (IIUIC, correct me if I'm wrong). In other words, I think it would be best to implement a proper state machine, but is it really necessary for 4.0? OTOH, I think we should focus on other {{StreamSession}} bugs such as CASSANDRA-15665 and CASSANDRA-15666. Thoughts? > Add verification of state machine in StreamSession > -- > > Key: CASSANDRA-14754 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14754 > Project: Cassandra > Issue Type: Task > Components: Legacy/Streaming and Messaging >Reporter: Jason Brown >Assignee: ZhaoYang >Priority: Normal > Fix For: 4.0 > > > {{StreamSession}} contains an implicit state machine, but we have no > verification of the safety of the transitions between states. For example, we > have no checks to ensure we cannot leave the final states (COMPLETED, FAILED). > I propose we add some program logic in {{StreamSession}}, tests, and > documentation to ensure the correctness of the state transitions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15666) Race condition when completing stream sessions
Sergio Bossa created CASSANDRA-15666: Summary: Race condition when completing stream sessions Key: CASSANDRA-15666 URL: https://issues.apache.org/jira/browse/CASSANDRA-15666 Project: Cassandra Issue Type: Bug Components: Legacy/Streaming and Messaging Reporter: Sergio Bossa {{StreamSession#prepareAsync()}} executes, as the name implies, asynchronously from the IO thread: this opens up for race conditions between the sending of the {{PrepareSynAckMessage}} and the call to {{StreamSession#maybeCompleted()}}. I.e., the following could happen: 1) Node A sends {{PrepareSynAckMessage}} from the {{prepareAsync()}} thread. 2) Node B receives it and starts streaming. 3) Node A receives the streamed file and sends {{ReceivedMessage}}. 4) At this point, if this was the only file to stream, both nodes are ready to close the session via {{maybeCompleted()}}, but: a) Node A will call it twice from both the IO thread and the thread at #1, closing the session and its channels. b) Node B will attempt to send a {{CompleteMessage}}, but will fail because the session has been closed in the meantime. There are other subtle variations of the pattern above, depending on the order of concurrently sent/received messages. I believe the best fix would be to modify the message exchange so that: 1) Only the "follower" is allowed to send the {{CompleteMessage}}. 2) Only the "initiator" is allowed to close the session and its channels after receiving the {{CompleteMessage}}. By doing so, the message exchange logic would be easier to reason about, which is overall a win anyway. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15665) StreamManager should clearly differentiate between "initiator" and "receiver" sessions
[ https://issues.apache.org/jira/browse/CASSANDRA-15665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-15665: - Fix Version/s: 4.0 > StreamManager should clearly differentiate between "initiator" and "receiver" > sessions > -- > > Key: CASSANDRA-15665 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15665 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Priority: Normal > Fix For: 4.0 > > > {{StreamManager}} does currently a suboptimal job in differentiating between > stream sessions (in form of {{StreamResultFuture}}) which have been either > initiated or "received", for the following reasons: > 1) Naming is IMO confusing: a "receiver" session could actually both send and > receive files, so technically an initiator is also a receiver. > 2) {{StreamManager#findSession()}} assumes we should first looking into > "initiator" sessions, then into "receiver" ones: this is a dangerous > assumptions, in particular for test environments where the same process could > work as both an initiator and a receiver. > I would recommend the following changes: > 1) Rename "receiver" with "follower" everywhere the former is used. > 2) Introduce a new flag into {{StreamMessageHeader}} to signal if the message > comes from an initiator or follower session, in order to correctly > differentiate and look for sessions in {{StreamManager}}. > While my arguments above might seem trivial, I believe they will improve > clarity and save from potential bugs/headaches at testing time, and doing > such changes now that we're revamping streaming for 4.0 seems the right time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15665) StreamManager should clearly differentiate between "initiator" and "receiver" sessions
Sergio Bossa created CASSANDRA-15665: Summary: StreamManager should clearly differentiate between "initiator" and "receiver" sessions Key: CASSANDRA-15665 URL: https://issues.apache.org/jira/browse/CASSANDRA-15665 Project: Cassandra Issue Type: Bug Components: Legacy/Streaming and Messaging Reporter: Sergio Bossa {{StreamManager}} does currently a suboptimal job in differentiating between stream sessions (in form of {{StreamResultFuture}}) which have been either initiated or "received", for the following reasons: 1) Naming is IMO confusing: a "receiver" session could actually both send and receive files, so technically an initiator is also a receiver. 2) {{StreamManager#findSession()}} assumes we should first looking into "initiator" sessions, then into "receiver" ones: this is a dangerous assumptions, in particular for test environments where the same process could work as both an initiator and a receiver. I would recommend the following changes: 1) Rename "receiver" with "follower" everywhere the former is used. 2) Introduce a new flag into {{StreamMessageHeader}} to signal if the message comes from an initiator or follower session, in order to correctly differentiate and look for sessions in {{StreamManager}}. While my arguments above might seem trivial, I believe they will improve clarity and save from potential bugs/headaches at testing time, and doing such changes now that we're revamping streaming for 4.0 seems the right time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15375) Remove BackPressureStrategy
[ https://issues.apache.org/jira/browse/CASSANDRA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051099#comment-17051099 ] Sergio Bossa commented on CASSANDRA-15375: -- bq. I'd like to disclaim any desire to discredit the original work Of course, and as I said it was experimental and meant to evolve once widely used, which never happened. Do you want me to review your patch? > Remove BackPressureStrategy > --- > > Key: CASSANDRA-15375 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15375 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client, Observability/Logging >Reporter: Jon Haddad >Assignee: Jon Haddad >Priority: Low > > This is odd: > {{INFO [main] 2019-10-25 10:33:07,985 DatabaseDescriptor.java:803 - > Back-pressure is disabled with strategy > org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, > flow=FAST}.}} > When I saw that, I wasn't sure if back pressure was actually disabled, or if > I was really using {{RateBasedBackPressure.}} > This should change to output either: > {{Back-pressure is disabled}} > {{or}} > {{Back-pressure is enabled with strategy > org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, > flow=FAST}.}}{{}} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15375) Remove BackPressureStrategy
[ https://issues.apache.org/jira/browse/CASSANDRA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050482#comment-17050482 ] Sergio Bossa commented on CASSANDRA-15375: -- As the original author of that implementation, I regard it as experimental and never proved to be widely used, so I agree with removing it. Dead code is bad code (regardless of its actual quality or past merits). > Remove BackPressureStrategy > --- > > Key: CASSANDRA-15375 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15375 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client, Observability/Logging >Reporter: Jon Haddad >Assignee: Jon Haddad >Priority: Low > > This is odd: > {{INFO [main] 2019-10-25 10:33:07,985 DatabaseDescriptor.java:803 - > Back-pressure is disabled with strategy > org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, > flow=FAST}.}} > When I saw that, I wasn't sure if back pressure was actually disabled, or if > I was really using {{RateBasedBackPressure.}} > This should change to output either: > {{Back-pressure is disabled}} > {{or}} > {{Back-pressure is enabled with strategy > org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, > flow=FAST}.}}{{}} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15375) Remove BackPressureStrategy
[ https://issues.apache.org/jira/browse/CASSANDRA-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050482#comment-17050482 ] Sergio Bossa edited comment on CASSANDRA-15375 at 3/3/20 7:01 PM: -- As the original author of that implementation, I regard it as experimental and never proven to be widely used, so I agree with removing it. Dead code is bad code (regardless of its actual quality or past merits). was (Author: sbtourist): As the original author of that implementation, I regard it as experimental and never proved to be widely used, so I agree with removing it. Dead code is bad code (regardless of its actual quality or past merits). > Remove BackPressureStrategy > --- > > Key: CASSANDRA-15375 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15375 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client, Observability/Logging >Reporter: Jon Haddad >Assignee: Jon Haddad >Priority: Low > > This is odd: > {{INFO [main] 2019-10-25 10:33:07,985 DatabaseDescriptor.java:803 - > Back-pressure is disabled with strategy > org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, > flow=FAST}.}} > When I saw that, I wasn't sure if back pressure was actually disabled, or if > I was really using {{RateBasedBackPressure.}} > This should change to output either: > {{Back-pressure is disabled}} > {{or}} > {{Back-pressure is enabled with strategy > org.apache.cassandra.net.RateBasedBackPressure\{high_ratio=0.9, factor=5, > flow=FAST}.}}{{}} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-14878) Race condition when setting bootstrap flags
Sergio Bossa created CASSANDRA-14878: Summary: Race condition when setting bootstrap flags Key: CASSANDRA-14878 URL: https://issues.apache.org/jira/browse/CASSANDRA-14878 Project: Cassandra Issue Type: Bug Reporter: Sergio Bossa Assignee: Sergio Bossa Fix For: 3.0.x, 3.11.x, 4.x {{StorageService#bootstrap()}} is supposed to wait for bootstrap to finish, but Guava calls the future listeners [after|https://github.com/google/guava/blob/ec2dedebfa359991cbcc8750dc62003be63ec6d3/guava/src/com/google/common/util/concurrent/AbstractFuture.java#L890] unparking its waiters, which causes a race on when the {{bootstrapFinished()}} will be executed, making it non-deterministic. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-8272) 2ndary indexes can return stale data
[ https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-8272: Reviewer: (was: Sergio Bossa) > 2ndary indexes can return stale data > > > Key: CASSANDRA-8272 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8272 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Andrés de la Peña >Priority: Major > Fix For: 3.0.x > > > When replica return 2ndary index results, it's possible for a single replica > to return a stale result and that result will be sent back to the user, > potentially failing the CL contract. > For instance, consider 3 replicas A, B and C, and the following situation: > {noformat} > CREATE TABLE test (k int PRIMARY KEY, v text); > CREATE INDEX ON test(v); > INSERT INTO test(k, v) VALUES (0, 'foo'); > {noformat} > with every replica up to date. Now, suppose that the following queries are > done at {{QUORUM}}: > {noformat} > UPDATE test SET v = 'bar' WHERE k = 0; > SELECT * FROM test WHERE v = 'foo'; > {noformat} > then, if A and B acknowledge the insert but C respond to the read before > having applied the insert, then the now stale result will be returned (since > C will return it and A or B will return nothing). > A potential solution would be that when we read a tombstone in the index (and > provided we make the index inherit the gcGrace of it's parent CF), instead of > skipping that tombstone, we'd insert in the result a corresponding range > tombstone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14507) OutboundMessagingConnection backlog is not fully written in case of race conditions
[ https://issues.apache.org/jira/browse/CASSANDRA-14507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16506886#comment-16506886 ] Sergio Bossa commented on CASSANDRA-14507: -- bq. Wouldn't they be picked up by the MessageOutHandler::channelWritabilityChanged and then get drained? That is assuming the writability changes before the timeout window, which might very well not be the case, unless I'm misunderstanding your question? Also, {{channelWritabilityChanged()}} is race-prone by itself as mentioned above. > OutboundMessagingConnection backlog is not fully written in case of race > conditions > --- > > Key: CASSANDRA-14507 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14507 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging >Reporter: Sergio Bossa >Priority: Major > > The {{OutboundMessagingConnection}} writes into a backlog queue before the > connection handshake is successfully completed, and then writes such backlog > to the channel as soon as the successful handshake moves the channel state to > {{READY}}. > This is unfortunately race prone, as the following could happen: > 1) One or more writer threads see the channel state as {{NOT_READY}} in > {{#sendMessage()}} and are about to enqueue to the backlog, but they get > descheduled by the OS. > 2) The handshake thread is scheduled by the OS and moves the channel state to > {{READY}}, emptying the backlog. > 3) The writer threads are scheduled back and add to the backlog, but the > channel state is {{READY}} at this point, so those writes would sit in the > backlog and expire. > Please note a similar race condition exists between > {{OutboundMessagingConnection#sendMessage()}} and > {{MessageOutHandler#channelWritabilityChanged()}}, which is way more serious > as the channel writability could frequently change, luckily it looks like > {{ChannelWriter#write()}} never gets invoked with {{checkWritability}} at > {{true}} (so writes never go to the backlog when the channel is not writable). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-14507) OutboundMessagingConnection backlog is not fully written in case of race conditions
Sergio Bossa created CASSANDRA-14507: Summary: OutboundMessagingConnection backlog is not fully written in case of race conditions Key: CASSANDRA-14507 URL: https://issues.apache.org/jira/browse/CASSANDRA-14507 Project: Cassandra Issue Type: Bug Components: Streaming and Messaging Reporter: Sergio Bossa The {{OutboundMessagingConnection}} writes into a backlog queue before the connection handshake is successfully completed, and then writes such backlog to the channel as soon as the successful handshake moves the channel state to {{READY}}. This is unfortunately race prone, as the following could happen: 1) One or more writer threads see the channel state as {{NOT_READY}} in {{#sendMessage()}} and are about to enqueue to the backlog, but they get descheduled by the OS. 2) The handshake thread is scheduled by the OS and moves the channel state to {{READY}}, emptying the backlog. 3) The writer threads are scheduled back and add to the backlog, but the channel state is {{READY}} at this point, so those writes would sit in the backlog and expire. Please note a similar race condition exists between {{OutboundMessagingConnection#sendMessage()}} and {{MessageOutHandler#channelWritabilityChanged()}}, which is way more serious as the channel writability could frequently change, luckily it looks like {{ChannelWriter#write()}} never gets invoked with {{checkWritability}} at {{true}} (so writes never go to the backlog when the channel is not writable). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-14503) Internode connection management is race-prone
Sergio Bossa created CASSANDRA-14503: Summary: Internode connection management is race-prone Key: CASSANDRA-14503 URL: https://issues.apache.org/jira/browse/CASSANDRA-14503 Project: Cassandra Issue Type: Bug Components: Streaming and Messaging Reporter: Sergio Bossa Following CASSANDRA-8457, internode connection management has been rewritten to rely on Netty, but the new implementation in {{OutboundMessagingConnection}} seems quite race prone to me, in particular on those two cases: * {{#finishHandshake()}} racing with {{#close()}}: i.e. in such case the former could run into an NPE if the latter nulls the {{channelWriter}} (but this is just an example, other conflicts might happen). * Connection timeout and retry racing with state changing methods: {{connectionRetryFuture}} and {{connectionTimeoutFuture}} are cancelled when handshaking or closing, but there's no guarantee those will be actually cancelled (as they might be already running), so they might end up changing the connection state concurrently with other methods (i.e. by unexpectedly closing the channel or clearing the backlog). Overall, the thread safety of {{OutboundMessagingConnection}} is very difficult to assess given the current implementation: I would suggest to refactor it into a single-thread model, where all connection state changing actions are enqueued on a single threaded scheduler, so that state transitions can be clearly defined and checked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13798) Disallow filtering on non-primary-key base column for MV
[ https://issues.apache.org/jira/browse/CASSANDRA-13798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-13798: - Reviewer: Paulo Motta > Disallow filtering on non-primary-key base column for MV > > > Key: CASSANDRA-13798 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13798 > Project: Cassandra > Issue Type: Bug > Components: Materialized Views >Reporter: ZhaoYang >Assignee: ZhaoYang > > We should probably consider disallow filtering conditions on non-primary-key > base column for Materialized View which is introduced in CASSANDRA-10368. > The main problem is that the liveness of view row is now depending on > multiple base columns (multiple filtered non-pk base column + base column > used in view pk) and this semantic could not be properly support without > drastic storage format changes. (SEE CASSANDRA-11500, > [background|https://issues.apache.org/jira/browse/CASSANDRA-11500?focusedCommentId=16119823&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16119823]) > We should step back and re-consider the non-primary-key filtering feature > together with supporting multiple non-PK cols in MV clustering key in > CASSANDRA-10226. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13640) CQLSH error when using 'login' to switch users
[ https://issues.apache.org/jira/browse/CASSANDRA-13640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-13640: - Reviewer: ZhaoYang > CQLSH error when using 'login' to switch users > -- > > Key: CASSANDRA-13640 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13640 > Project: Cassandra > Issue Type: Bug > Components: CQL >Reporter: Andrés de la Peña >Assignee: Andrés de la Peña >Priority: Minor > Fix For: 3.0.x > > > Using {{PasswordAuthenticator}} and {{CassandraAuthorizer}}: > {code} > bin/cqlsh -u cassandra -p cassandra > Connected to Test Cluster at 127.0.0.1:9042. > [cqlsh 5.0.1 | Cassandra 3.0.14-SNAPSHOT | CQL spec 3.4.0 | Native protocol > v4] > Use HELP for help. > cassandra@cqlsh> create role super with superuser = true and password = 'p' > and login = true; > cassandra@cqlsh> login super; > Password: > super@cqlsh> list roles; > 'Row' object has no attribute 'values' > {code} > When we initialize the Shell, we configure certain settings on the session > object such as > {code} > self.session.default_timeout = request_timeout > self.session.row_factory = ordered_dict_factory > self.session.default_consistency_level = cassandra.ConsistencyLevel.ONE > {code} > However, once we perform a LOGIN cmd, which calls do_login(..), we create a > new cluster/session object but actually never set those settings on the new > session. > It isn't failing on 3.x. > As a workaround, it is possible to logout and log back in and things work > correctly. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13464) Failed to create Materialized view with a specific token range
[ https://issues.apache.org/jira/browse/CASSANDRA-13464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-13464: - Reviewer: ZhaoYang > Failed to create Materialized view with a specific token range > -- > > Key: CASSANDRA-13464 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13464 > Project: Cassandra > Issue Type: Improvement >Reporter: Natsumi Kojima >Assignee: Krishna Dattu Koneru >Priority: Minor > Labels: materializedviews > > Failed to create Materialized view with a specific token range. > Example : > {code:java} > $ ccm create "MaterializedView" -v 3.0.13 > $ ccm populate -n 3 > $ ccm start > $ ccm status > Cluster: 'MaterializedView' > --- > node1: UP > node3: UP > node2: UP > $ccm node1 cqlsh > Connected to MaterializedView at 127.0.0.1:9042. > [cqlsh 5.0.1 | Cassandra 3.0.13 | CQL spec 3.4.0 | Native protocol v4] > Use HELP for help. > cqlsh> CREATE KEYSPACE test WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > cqlsh> CREATE TABLE test.test ( id text PRIMARY KEY , value1 text , value2 > text, value3 text); > $ccm node1 ring test > Datacenter: datacenter1 > == > AddressRackStatus State LoadOwns > Token > > 3074457345618258602 > 127.0.0.1 rack1 Up Normal 64.86 KB100.00% > -9223372036854775808 > 127.0.0.2 rack1 Up Normal 86.49 KB100.00% > -3074457345618258603 > 127.0.0.3 rack1 Up Normal 89.04 KB100.00% > 3074457345618258602 > $ ccm node1 cqlsh > cqlsh> INSERT INTO test.test (id, value1 , value2, value3 ) VALUES ('aaa', > 'aaa', 'aaa' ,'aaa'); > cqlsh> INSERT INTO test.test (id, value1 , value2, value3 ) VALUES ('bbb', > 'bbb', 'bbb' ,'bbb'); > cqlsh> SELECT token(id),id,value1 FROM test.test; > system.token(id) | id | value1 > --+-+ > -4737872923231490581 | aaa |aaa > -3071845237020185195 | bbb |bbb > (2 rows) > cqlsh> CREATE MATERIALIZED VIEW test.test_view AS SELECT value1, id FROM > test.test WHERE id IS NOT NULL AND value1 IS NOT NULL AND TOKEN(id) > > -9223372036854775808 AND TOKEN(id) < -3074457345618258603 PRIMARY KEY(value1, > id) WITH CLUSTERING ORDER BY (id ASC); > ServerError: java.lang.ClassCastException: > org.apache.cassandra.cql3.TokenRelation cannot be cast to > org.apache.cassandra.cql3.SingleColumnRelation > {code} > Stacktrace : > {code:java} > INFO [MigrationStage:1] 2017-04-19 18:32:48,131 ColumnFamilyStore.java:389 - > Initializing test.test > WARN [SharedPool-Worker-1] 2017-04-19 18:44:07,263 FBUtilities.java:337 - > Trigger directory doesn't exist, please create it and try again. > ERROR [SharedPool-Worker-1] 2017-04-19 18:46:10,072 QueryMessage.java:128 - > Unexpected error during query > java.lang.ClassCastException: org.apache.cassandra.cql3.TokenRelation cannot > be cast to org.apache.cassandra.cql3.SingleColumnRelation > at > org.apache.cassandra.db.view.View.relationsToWhereClause(View.java:275) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.cql3.statements.CreateViewStatement.announceMigration(CreateViewStatement.java:219) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.cql3.statements.SchemaAlteringStatement.execute(SchemaAlteringStatement.java:93) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:206) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:237) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:222) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115) > ~[apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513) > [apache-cassandra-3.0.13.jar:3.0.13] > at > org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407) > [apache-cassandra-3.0.13.jar:3.0.13] > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) > [netty
[jira] [Updated] (CASSANDRA-13737) Node start can fail if the base table of a materialized view is not found
[ https://issues.apache.org/jira/browse/CASSANDRA-13737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-13737: - Reviewer: T Jake Luciani > Node start can fail if the base table of a materialized view is not found > - > > Key: CASSANDRA-13737 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13737 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata, Materialized Views >Reporter: Andrés de la Peña >Assignee: Andrés de la Peña > Fix For: 3.0.x, 3.11.x, 4.x > > > Node start can fail if the base table of a materialized view is not found, > which is something that can happen under certain circumstances. There is a > dtest reproducing the problem: > {code} > cluster = self.cluster > cluster.populate(3) > cluster.start() > node1, node2, node3 = self.cluster.nodelist() > session = self.patient_cql_connection(node1, > consistency_level=ConsistencyLevel.QUORUM) > create_ks(session, 'ks', 3) > session.execute('CREATE TABLE users (username varchar PRIMARY KEY, state > varchar)') > node3.stop(wait_other_notice=True) > # create a materialized view only in nodes 1 and 2 > session.execute(('CREATE MATERIALIZED VIEW users_by_state AS ' > 'SELECT * FROM users WHERE state IS NOT NULL AND username IS > NOT NULL ' > 'PRIMARY KEY (state, username)')) > node1.stop(wait_other_notice=True) > node2.stop(wait_other_notice=True) > # drop the base table only in node 3 > node3.start(wait_for_binary_proto=True) > session = self.patient_cql_connection(node3, > consistency_level=ConsistencyLevel.QUORUM) > session.execute('DROP TABLE ks.users') > cluster.stop() > cluster.start() # Fails > {code} > This is the error during node start: > {code} > java.lang.IllegalArgumentException: Unknown CF > 958ebc30-76e4-11e7-869a-9d8367a71c76 > at > org.apache.cassandra.db.Keyspace.getColumnFamilyStore(Keyspace.java:215) > ~[main/:na] > at > org.apache.cassandra.db.view.ViewManager.addView(ViewManager.java:143) > ~[main/:na] > at > org.apache.cassandra.db.view.ViewManager.reload(ViewManager.java:113) > ~[main/:na] > at org.apache.cassandra.schema.Schema.alterKeyspace(Schema.java:618) > ~[main/:na] > at org.apache.cassandra.schema.Schema.lambda$merge$18(Schema.java:591) > ~[main/:na] > at > java.util.Collections$UnmodifiableMap$UnmodifiableEntrySet.lambda$entryConsumer$0(Collections.java:1575) > ~[na:1.8.0_131] > at java.util.HashMap$EntrySet.forEach(HashMap.java:1043) ~[na:1.8.0_131] > at > java.util.Collections$UnmodifiableMap$UnmodifiableEntrySet.forEach(Collections.java:1580) > ~[na:1.8.0_131] > at org.apache.cassandra.schema.Schema.merge(Schema.java:591) ~[main/:na] > at > org.apache.cassandra.schema.Schema.mergeAndAnnounceVersion(Schema.java:564) > ~[main/:na] > at > org.apache.cassandra.schema.MigrationTask$1.response(MigrationTask.java:89) > ~[main/:na] > at > org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53) > ~[main/:na] > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72) > ~[main/:na] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_131] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ~[na:1.8.0_131] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ~[na:1.8.0_131] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_131] > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > [main/:na] > at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_131] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-12952) AlterTableStatement propagates base table and affected MV changes inconsistently
[ https://issues.apache.org/jira/browse/CASSANDRA-12952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-12952: - Reviewer: ZhaoYang > AlterTableStatement propagates base table and affected MV changes > inconsistently > > > Key: CASSANDRA-12952 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12952 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata, Materialized Views >Reporter: Aleksey Yeschenko >Assignee: Andrés de la Peña > Fix For: 3.0.x, 3.11.x, 4.x > > > In {{AlterTableStatement}}, when renaming columns or changing their types, we > also keep track of all affected MVs - ones that also need column renames or > type changes. Then in the end we announce the migration for the table change, > and afterwards, separately, one for each affected MV. > This creates a window in which view definitions and base table definition are > not in sync with each other. If a node fails in between receiving those > pushes, it's likely to have startup issues. > The fix is trivial: table change and affected MV change should be pushed as a > single schema mutation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13725) Secondary indexes are always rebuilt at startup
[ https://issues.apache.org/jira/browse/CASSANDRA-13725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-13725: - Reviewer: Andrés de la Peña > Secondary indexes are always rebuilt at startup > --- > > Key: CASSANDRA-13725 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13725 > Project: Cassandra > Issue Type: Bug > Components: Secondary Indexes >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Critical > Fix For: 4.0 > > > Following CASSANDRA-10130, a bug has been introduced that causes a 2i to be > rebuilt at startup, even if such index is already built. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13725) Secondary indexes are always rebuilt at startup
[ https://issues.apache.org/jira/browse/CASSANDRA-13725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-13725: - Status: Patch Available (was: Open) > Secondary indexes are always rebuilt at startup > --- > > Key: CASSANDRA-13725 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13725 > Project: Cassandra > Issue Type: Bug > Components: Secondary Indexes >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Critical > Fix For: 4.0 > > > Following CASSANDRA-10130, a bug has been introduced that causes a 2i to be > rebuilt at startup, even if such index is already built. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13725) Secondary indexes are always rebuilt at startup
[ https://issues.apache.org/jira/browse/CASSANDRA-13725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-13725: - This is caused by calling {{SIM#markIndexesBuilding}} when creating the index during column family initialization, which marks the index as "not built" and causes the index initialization task to rebuild it. Given there's no need to mark the index when a new column family is created (as the index will be "not built" by definition and there can't be any concurrent indexing), we can just pass a boolean up to {{createIndex()}} to distinguish between index creation at different times, i.e. when a column family is [created|https://github.com/sbtourist/cassandra/blob/CASSANDRA-13725/src/java/org/apache/cassandra/db/Keyspace.java#L394] or [reloaded|https://github.com/sbtourist/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L129]. Such solution is implemented in the following patch, with a new dtest verifying it: |[trunk|https://github.com/apache/cassandra/pull/135]|[dtest|https://github.com/apache/cassandra-dtest/pull/2]| Test runs are in progress on our internal CI and I will report results as soon as they're ready. > Secondary indexes are always rebuilt at startup > --- > > Key: CASSANDRA-13725 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13725 > Project: Cassandra > Issue Type: Bug > Components: Secondary Indexes >Reporter: Sergio Bossa >Assignee: Sergio Bossa >Priority: Critical > Fix For: 4.0 > > > Following CASSANDRA-10130, a bug has been introduced that causes a 2i to be > rebuilt at startup, even if such index is already built. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-13725) Secondary indexes are always rebuilt at startup
Sergio Bossa created CASSANDRA-13725: Summary: Secondary indexes are always rebuilt at startup Key: CASSANDRA-13725 URL: https://issues.apache.org/jira/browse/CASSANDRA-13725 Project: Cassandra Issue Type: Bug Components: Secondary Indexes Reporter: Sergio Bossa Assignee: Sergio Bossa Priority: Critical Fix For: 4.0 Following CASSANDRA-10130, a bug has been introduced that causes a 2i to be rebuilt at startup, even if such index is already built. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13547) Filtered materialized views missing data
[ https://issues.apache.org/jira/browse/CASSANDRA-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-13547: - Reviewer: ZhaoYang > Filtered materialized views missing data > > > Key: CASSANDRA-13547 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13547 > Project: Cassandra > Issue Type: Bug > Components: Materialized Views > Environment: Official Cassandra 3.10 Docker image (ID 154b919bf8ce). >Reporter: Craig Nicholson >Assignee: Krishna Dattu Koneru >Priority: Blocker > Labels: materializedviews > Fix For: 3.11.x > > > When creating a materialized view against a base table the materialized view > does not always reflect the correct data. > Using the following test schema: > {code:title=Schema|language=sql} > DROP KEYSPACE IF EXISTS test; > CREATE KEYSPACE test > WITH REPLICATION = { >'class' : 'SimpleStrategy', >'replication_factor' : 1 > }; > CREATE TABLE test.table1 ( > id int, > name text, > enabled boolean, > foo text, > PRIMARY KEY (id, name)); > CREATE MATERIALIZED VIEW test.table1_mv1 AS SELECT id, name, foo > FROM test.table1 > WHERE id IS NOT NULL > AND name IS NOT NULL > AND enabled = TRUE > PRIMARY KEY ((name), id); > CREATE MATERIALIZED VIEW test.table1_mv2 AS SELECT id, name, foo, enabled > FROM test.table1 > WHERE id IS NOT NULL > AND name IS NOT NULL > AND enabled = TRUE > PRIMARY KEY ((name), id); > {code} > When I insert a row into the base table the materialized views are updated > appropriately. (+) > {code:title=Insert row|language=sql} > cqlsh> INSERT INTO test.table1 (id, name, enabled, foo) VALUES (1, 'One', > TRUE, 'Bar'); > cqlsh> SELECT * FROM test.table1; > id | name | enabled | foo > +--+-+- > 1 | One |True | Bar > (1 rows) > cqlsh> SELECT * FROM test.table1_mv1; > name | id | foo > --++- > One | 1 | Bar > (1 rows) > cqlsh> SELECT * FROM test.table1_mv2; > name | id | enabled | foo > --++-+- > One | 1 |True | Bar > (1 rows) > {code} > Updating the record in the base table and setting enabled to FALSE will > filter the record from both materialized views. (+) > {code:title=Disable the row|language=sql} > cqlsh> UPDATE test.table1 SET enabled = FALSE WHERE id = 1 AND name = 'One'; > cqlsh> SELECT * FROM test.table1; > id | name | enabled | foo > +--+-+- > 1 | One | False | Bar > (1 rows) > cqlsh> SELECT * FROM test.table1_mv1; > name | id | foo > --++- > (0 rows) > cqlsh> SELECT * FROM test.table1_mv2; > name | id | enabled | foo > --++-+- > (0 rows) > {code} > However a further update to the base table setting enabled to TRUE should > include the record in both materialzed views, however only one view > (table1_mv2) gets updated. (-) > It appears that only the view (table1_mv2) that returns the filtered column > (enabled) is updated. (-) > Additionally columns that are not part of the partiion or clustering key are > not updated. You can see that the foo column has a null value in table1_mv2. > (-) > {code:title=Enable the row|language=sql} > cqlsh> UPDATE test.table1 SET enabled = TRUE WHERE id = 1 AND name = 'One'; > cqlsh> SELECT * FROM test.table1; > id | name | enabled | foo > +--+-+- > 1 | One |True | Bar > (1 rows) > cqlsh> SELECT * FROM test.table1_mv1; > name | id | foo > --++- > (0 rows) > cqlsh> SELECT * FROM test.table1_mv2; > name | id | enabled | foo > --++-+-- > One | 1 |True | null > (1 rows) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13627) Index queries are rejected on COMPACT tables
[ https://issues.apache.org/jira/browse/CASSANDRA-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-13627: - Reviewer: Andrés de la Peña (was: Benjamin Lerer) > Index queries are rejected on COMPACT tables > > > Key: CASSANDRA-13627 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13627 > Project: Cassandra > Issue Type: Bug > Components: CQL >Reporter: Benjamin Lerer >Assignee: Benjamin Lerer > > Since {{3.0}}, {{compact}} tables are using under the hood {{static}} > columns. Due to that {{SELECT}} queries using secondary indexes get rejected > with the following error: > {{Queries using 2ndary indexes don't support selecting only static columns}}. > This problem can be reproduced using the following unit test: > {code}@Test > public void testIndicesOnCompactTable() throws Throwable > { > createTable("CREATE TABLE %s (pk int PRIMARY KEY, v int) WITH COMPACT > STORAGE"); > createIndex("CREATE INDEX ON %s(v)"); > execute("INSERT INTO %S (pk, v) VALUES (?, ?)", 1, 1); > execute("INSERT INTO %S (pk, v) VALUES (?, ?)", 2, 1); > execute("INSERT INTO %S (pk, v) VALUES (?, ?)", 3, 3); > assertRows(execute("SELECT pk, v FROM %s WHERE v = 1"), >row(1, 1), >row(2, 1)); > }{code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060619#comment-16060619 ] Sergio Bossa commented on CASSANDRA-10130: -- Excellent job everyone! +1 > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-10130: - Component/s: (was: Materialized Views) > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059725#comment-16059725 ] Sergio Bossa commented on CASSANDRA-10130: -- Excellent, can we have another utests/dtests run before the final +1? > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination, Materialized Views >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13512) SASI full-text search queries using standard analyzer do not work in multi-node environments
[ https://issues.apache.org/jira/browse/CASSANDRA-13512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-13512: - Reviewer: Andrés de la Peña > SASI full-text search queries using standard analyzer do not work in > multi-node environments > - > > Key: CASSANDRA-13512 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13512 > Project: Cassandra > Issue Type: Bug >Reporter: Alex Petrov >Assignee: Alex Petrov > > SASI full-text search queries using standard analyzer do not work in > multi-node environments. Standard Analyzer will rewind the buffer and search > term will be empty for any node other than coordinator, so will return no > results. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059129#comment-16059129 ] Sergio Bossa commented on CASSANDRA-10130: -- [~adelapena], the latest commit looks good, just two minor notes about tests: 1) Shouldn't we set the [throwable|https://github.com/adelapena/cassandra/compare/ca7da30f621e742f85b6a7b1f66d320ba224a6a4...adelapena:0f6972eacdab6b0c81e00d8c0c59968106d3f462#diff-0ccfd193eadf8f1c2a48cf846b4eb791R549] null on {{TestingIndex#clear()}}? 2) Shouldn't we test the stability inspector on index creation too? > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054296#comment-16054296 ] Sergio Bossa commented on CASSANDRA-10130: -- bq. I think we should probably run exceptions during 2i rebuild failure (logAndMarkIndexesFailed) via the JVMStabilityInspector Agreed. +1 otherwise, excellent job everyone :) > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053861#comment-16053861 ] Sergio Bossa commented on CASSANDRA-10130: -- bq. I really think it makes the code cleaner and less prone to errors to update SecondaryIndexManager state (needsFullRebuild, queryableIndexes and inProgressBuilds) exclusively in the mark* methods rather than in many places throughout the code. I know that would be cleaner, but bear in mind that's also a change from previous behaviour *and* would cause the index to be considered queryable even if the initialization task failed and later a non-full rebuild succeeded. I'm not a fan of that, but I'll leave it to you guys. bq. I also removed the marking of index as failed before fetching the initialization task on this commit. Looks good, and also properly marks the index failed (which we forgot to do before). bq. Regarding the exception being passed as the last parameter to the logger, that's not a bug since sl4j will properly handle that case Apologies, I misread the diff. I'd like to finally highlight two further changes from previous behaviour that just crossed my mind (unrelated to Paulo's patches): 1) If the index initialization fails because of a non build error, and the index was previously built, it will be rebuilt again from scratch anyway. 2) If a single sstable build fails, the index will be (potentially) fully rebuilt at restart; while theoretically correct, this could have a non-trivial performance impact at startup. That said, #1 is probably better addressed by CASSANDRA-13606, while #2 is what worries me most; I personally do not think that automatically rebuilding the indexes at startup gives much advantages, as it's an async operation anyway, so I'd propose to either/both add a {{cassandra.yaml}} option to enable automatic rebuilds explicitly or/and add a flag to NodeTool to rebuild indexes _only if_ not already built. Thoughts? > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051990#comment-16051990 ] Sergio Bossa commented on CASSANDRA-10130: -- bq. Update queryableIndexes only on markIndexBuilt and markIndexRemoved I think we should preserve the original behaviour; it is not perfect, but CASSANDRA-13606 should provide a proper fix. bq. Remove unnecessary marking of index in need of full rebuild on index creation I do not think that was unnecessary, because `createIndex` could fail due to an exception in {{Index#getInitializationTask()}}, hence should be considered as failed. bq. Warn user to run full rebuild after index failure and log failure stack trace if present +1 (but {{logAndMarkIndexesFailed}} seems to have a bug [here|https://github.com/pauloricardomg/cassandra/commit/19b6f3c68d1b00895de2d3da5aec54102b44ce9f#diff-3f2c8994c4ff8748c3faf7e70958520dR627]) Other than that, I'm +1 on the latest changes, I think we're almost done :) > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16050347#comment-16050347 ] Sergio Bossa commented on CASSANDRA-10130: -- Also, a style note about indentation for lambdas and anonymous classes; I see [some|https://github.com/adelapena/cassandra/blob/eb0316d651bbdec70896be041ab1981cc9349a18/src/java/org/apache/cassandra/index/SecondaryIndexManager.java#L453] are deeply indented, but I think that makes the code harder to read, and I prefer [standard 4 spaces indent|https://github.com/adelapena/cassandra/blob/eb0316d651bbdec70896be041ab1981cc9349a18/src/java/org/apache/cassandra/index/SecondaryIndexManager.java#L492]: thoughts? Whatever the choice, we should make the patch consistent (but possibly avoid to change other untouched code) > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16050313#comment-16050313 ] Sergio Bossa commented on CASSANDRA-10130: -- As a side note, I opened CASSANDRA-13606 to further improve initialization failures handling. > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-13606) Improve handling of 2i initialization failures
Sergio Bossa created CASSANDRA-13606: Summary: Improve handling of 2i initialization failures Key: CASSANDRA-13606 URL: https://issues.apache.org/jira/browse/CASSANDRA-13606 Project: Cassandra Issue Type: Improvement Reporter: Sergio Bossa Assignee: Sergio Bossa Fix For: 4.0 CASSANDRA-10130 fixes the 2i build management, but initialization failures are still not properly handled, most notably because: * Initialization failures make the index non-queryable, but it can still be written to. * Initialization failures can be recovered via full rebuilds. Both points above are probably suboptimal because the initialization logic could be more complex than just an index build, hence it shouldn't be made recoverable via a simple rebuild, and could cause the index to be fully unavailable not just for reads, but for writes as well. So, we should better handle initialization failures by: * Allowing the index implementation to specify if unavailable for reads, writes, or both. * Providing a proper method to recover, distinct from index rebuilds. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16050253#comment-16050253 ] Sergio Bossa commented on CASSANDRA-10130: -- [~adelapena], [~pauloricardomg], excellent additions overall, just a few comments: * I mistakenly removed {{queryableIndexes}} in my refactor (as noted); [this|https://github.com/adelapena/cassandra/commit/c5604ffd5ae36cc34d13af6c4c4eadba98458fa3] commit added it back, but it's unfortunately not enough, as if the initialization task fails, the index will never be made queryable: we need to fix this, and we need a test for such failure case. * Nit/OCD: I would change the [rebuildFromSSTablesBlocking|https://github.com/adelapena/cassandra/commit/a945358a36f97632e94e9103650d72c04735354d#diff-3f2c8994c4ff8748c3faf7e70958520dR306] signature to have the {{sstables}} first, consistently with {{buildIndexesBlocking}}. * In {{buildIndexesBlocking}}, any exceptions thrown by {{flushIndexesBlocking}} in the {{finally}} block will hide previous exceptions: we should accumulate them. * Nit/OCD: In {{buildIndexesBlocking}}, we should put an additional comment before invoking {{flushIndexesBlocking}}. * If {{rebuildFromSSTablesBlocking}} is not used elsewhere, I would strongly recommend to make it private/protected and move its comment to {{rebuildIndexesBlocking}}: the [general rule|https://image.slidesharecdn.com/effectivejava-2ndedition-chapter4-151203200429-lva1-app6891/95/effective-java-chapter-4-classes-and-interfaces-4-638.jpg?cb=1449173201] is methods (as well as attributes) should have the most restricted visibility allowed by working code. * Regarding the {{test_standalone_scrub}} failure, I _believe_ it's due to the fact we only manage counters in our marking methods when {{DatabaseDescriptor.isDaemonInitialized()}}, which is obviously not the case with the scrubber; if so, it's probably an easy fix. > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049090#comment-16049090 ] Sergio Bossa edited comment on CASSANDRA-10130 at 6/14/17 1:31 PM: --- [~adelapena], I reviewed your latest patch and found the concurrent lifecycle implementation a bit hard to understand and maintain. I tried to explain it and give suggestions in writing, but in the end I found it easier and more productive to try an alternative implementation by myself, which you can find [here|https://github.com/sbtourist/cassandra/commit/efe452846d29173e8b0d522651e7dc9f49f2597e]; in a nutshell, this implementation doesn't require any additional locks and manages a fully thread safe lifecycle whose cases I tried to extensively test. Let me know if you (and [~pauloricardomg]) have any feedback, and feel free to pull it in your branch (just note the patch is implemented against [this|https://github.com/adelapena/cassandra/commit/c002d26255eaf53070a840d7232788e33f852e19] commit in your branch, that is the one previous to your latest changes). was (Author: sbtourist): [~adelapena], I reviewed your latest patch and found the concurrent lifecycle implementation a bit hard to understand and maintain. I tried to explain it and give suggestions in writing, but in the end I found it easier and more productive to try an alternative implementation by myself, which you can find [here|https://github.com/sbtourist/cassandra/commit/e1e316eae4d45a01aff7c86e1e21dc3a555e9ca5]; in a nutshell, this implementation doesn't require any additional locks and manages a fully thread safe lifecycle whose cases I tried to extensively test. Let me know if you (and [~pauloricardomg]) have any feedback, and feel free to pull it in your branch (just note the patch is implemented against [this|https://github.com/adelapena/cassandra/commit/c002d26255eaf53070a840d7232788e33f852e19] commit in your branch, that is the one previous to your latest changes). > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049090#comment-16049090 ] Sergio Bossa edited comment on CASSANDRA-10130 at 6/14/17 11:49 AM: [~adelapena], I reviewed your latest patch and found the concurrent lifecycle implementation a bit hard to understand and maintain. I tried to explain it and give suggestions in writing, but in the end I found it easier and more productive to try an alternative implementation by myself, which you can find [here|https://github.com/sbtourist/cassandra/commit/e1e316eae4d45a01aff7c86e1e21dc3a555e9ca5]; in a nutshell, this implementation doesn't require any additional locks and manages a fully thread safe lifecycle whose cases I tried to extensively test. Let me know if you (and [~pauloricardomg]) have any feedback, and feel free to pull it in your branch (just note the patch is implemented against [this|https://github.com/adelapena/cassandra/commit/c002d26255eaf53070a840d7232788e33f852e19] commit in your branch, that is the one previous to your latest changes). was (Author: sbtourist): [~adelapena], I reviewed your latest patch and found the concurrent lifecycle implementation a bit hard to understand and maintain. I tried to explain it and give suggestions in writing, but in the end I found it easier and more productive to try an alternative implementation by myself, which you can find [here|https://github.com/sbtourist/cassandra/commit/e1e316eae4d45a01aff7c86e1e21dc3a555e9ca5]; in a nutshell, this implementation doesn't require any additional locks and manages a fully thread safe lifecycle whose cases I tried to extensively test. Let me know if you (and [~pauloricardomg]) have any feedback, and feel free to pull it in your branch. > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049090#comment-16049090 ] Sergio Bossa commented on CASSANDRA-10130: -- [~adelapena], I reviewed your latest patch and found the concurrent lifecycle implementation a bit hard to understand and maintain. I tried to explain it and give suggestions in writing, but in the end I found it easier and more productive to try an alternative implementation by myself, which you can find [here|https://github.com/sbtourist/cassandra/commit/e1e316eae4d45a01aff7c86e1e21dc3a555e9ca5]; in a nutshell, this implementation doesn't require any additional locks and manages a fully thread safe lifecycle whose cases I tried to extensively test. Let me know if you (and [~pauloricardomg]) have any feedback, and feel free to pull it in your branch. > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038692#comment-16038692 ] Sergio Bossa commented on CASSANDRA-10130: -- bq. Sorry for being picky here but while we are fixing the original limitation, we are introducing a new limitation that if there's ever a non-fatal index build failure, a successful full index rebuild will not mark the index as built until the node is restarted and the index is unnecessarily rebuilt. Excellent point, you're not being picky at all. There's actually a related problem: if a single sstable indexing fails, we restart the node, and try to load a *new* sstable, the index will be marked as built, even if there's an sstable whose indexing failed. In other words, it seems to me we should mark the index as built _only_ if: 1) It is a full rebuild. 2) It is an initialization task (which should be considered as the initial full build). 3) It is a single/group sstable(s) indexing, and the index was *already built*: that is, if we initiate an sstable indexing, but the index was *not* marked as built, we should preserve such state (that implementation-wise probably means to just keep the counters at 0 and avoid any marking). Other than that, I have a concern about [flushing indexes in the future transformation|https://github.com/apache/cassandra/compare/trunk...adelapena:10130-trunk#diff-3f2c8994c4ff8748c3faf7e70958520dR399], which would cause such blocking activity to happen in the compaction thread, a departure from previous behaviour and probably an unwanted one. > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034963#comment-16034963 ] Sergio Bossa commented on CASSANDRA-10130: -- bq. It's still possible that an index is created after the SSTableAddedNotification but before all sstables are added to the tracker That made me also think of an ABA-like race, where a new index is registered _and built_ in-between the two notifications, causing it to miss the new-to-be-added sstables. Weird stuff, and overall excellent points [~pauloricardomg]. Given most of the races come from the 2-phase event-based solution, I'd say to simplify it in a way that still at least solves the original issue: that is, keep just the {{SSTableAddedNotification}}, and mark the index building/built around that; this will not protect against failures happening _while_ adding SSTables, but again, it would at least solve the original problem of the index itself failing after adding the SSTables and missing to rebuild. Then, we can think about a more complete solution (i.e. one that doesn't rebuild the whole index because of a single failed stream) in another ticket. Thoughts? > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034482#comment-16034482 ] Sergio Bossa commented on CASSANDRA-10130: -- bq. I have replaced SSTableLoadedNotification by a SSTableBeforeAddedNotification Nit: {{SSTableBeforeAddedNotification}} would probably read better as {{SSTableBeforeAddNotification}}. bq. It's not clear that users of buildIndexesBlocking should mark the indexes as building beforehand, so we should make that explicit via an assertion and probably also add a comment A comment is a good idea, but I think we shouldn't go further than that: {{buildIndexesBlocking}} is private so we can assume callers know what they're doing. bq. There's still a slight chance that an index is created between an SSTableBeforeAddedNotification and SSTableAddedNotification and we won't have marked it as building I think in such case the new {{Index}} initialization task would take care of indexing, so it doesn't really matter if the new index misses the sstable notification. So, to cover against the specific race you mentioned, we can just filter out in {{SIM#handleNotification()}}, when receiving the {{SSTableAddedNotification}}, all indexes not marked as building, as we can assume those missed the first notification because not yet registered (and being just registered, the initialization task will eventually take care of any initial indexing). bq. we could probably get rid of the pendingBuilds counter complexity since we wouldn't race with manual individual index rebuilds on the BUILT_INDEXES table I think the race above can be solved easily without adding columns to the system table. Other than that, let's not forget the {{pendingBuilds}} counter was needed to protect us not just against concurrent building of multiple indexes, but also against concurrent building of multiple sstables "batches" for the same index. Thoughts? I'll now proceed with reviewing the latest code changes. > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031771#comment-16031771 ] Sergio Bossa commented on CASSANDRA-10130: -- [~pauloricardomg], excellent suggestion. [~adelapena], I personally like the "memtable" version more, as logically speaking, if an sstable is added with a memtable, it means it has been also indexed, while in all other cases it means the sstable(s) have been externally loaded and needs indexing, but I have a few concerns: 1) There are too many overloads of {{addSSTable(s)}}, and some of them do not make much sense and are there just to support self calls (i.e. the version with many sstables and a single memtable I believe), so can we clean that up? 2) Who is actually calling the {{addSSTable}} method with the memtable? I think I'm missing where it's actually used... 3) Why did you reduce the test coverage in {{indexCorrectlyMarkedAsBuildAndRemoved}}? > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015593#comment-16015593 ] Sergio Bossa commented on CASSANDRA-10130: -- [~adelapena], the approach looks good to me, here is some specific feedback: * Why executing the [preBuildTask|https://github.com/apache/cassandra/compare/trunk...adelapena:10130-trunk#diff-3f2c8994c4ff8748c3faf7e70958520dR389] in another thread via {{executeBlocking}}, rather than just calling {{Runnable#run}}? * Currently, {{buildIndexesBlocking}} will end up rebuilding *all* indexes even if just a single one failed, because {{markIndexBuilt}} is called in bulk at the very end; I know this is in line with the previous behaviour, but wouldn't it make sense to improve it in this issue? * Do we have tests checking: ** Index status pre and post (re)building actions (create index, rebuild index). ** Index status upon index removal. ** Index status and automatic rebuild in case of failures. > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012006#comment-16012006 ] Sergio Bossa commented on CASSANDRA-10130: -- Some more feedback by my side too. I think there still is a subtle concurrency issue, due to the fact we have some methods managing the "index status" in the system keyspace via the counter, and other methods directly messing with the system keyspace and/or the pending builds map; i.e., what happens if the index is unregistered while it is being rebuilt? Probably an [NPE|https://github.com/apache/cassandra/compare/trunk...adelapena:10130-trunk#diff-3f2c8994c4ff8748c3faf7e70958520dR424]. I would suggest to clean it up a bit by avoiding to keep "pending builds" in the map after they're all done, that is after the index has been marked as built: by doing so, you can just remove the pending build inside {{markIndexBuilt}} if the counter reaches 0 (you'd have to guard the {{mark*}} methods rather than the counter ones), and hence keep the "build management" code inside the two "mark building/built" methods only. I'm kinda unsure if we should actually call {{markIndexBuilding}} inside {{createIndex}}: what if the user forgets to call {{markIndexBuilt}} inside the initialization task? There's no such contract forcing the user to do that. So, as a corollary, we should probably accept calls to {{markIndexBuilt}} even without a previous {{markIndexBuilding}} call (that is, making it idempotent). Also, {{buildAllIndexesBlocking}} doesn't follow the "mark building/built" pattern: isn't that a weird anomaly? I know the {{StreamReceiveTask}} does that, but what about other callers? Regarding {{PendingIndexBuildsCounter}}, it isn't really just a counter, I would rename it (i.e. just {{PendingBuild}}), and rename its methods as well for further clarity; or we could ditch it altogether as suggested by [~pauloricardomg]. Finally regarding: bq. the user will expect the index NOT to be rebuilt on restart if there was a subsequent successful rebuild. Is it what [~adelapena] actually meant? I do not think the index is actually rebuilt on restart if successfully rebuilt manually, or am I missing the point? > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-8273) Allow filtering queries can return stale data
[ https://issues.apache.org/jira/browse/CASSANDRA-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa reassigned CASSANDRA-8273: --- Assignee: Andrés de la Peña > Allow filtering queries can return stale data > - > > Key: CASSANDRA-8273 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8273 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Andrés de la Peña > > Data filtering is done replica side. That means that a single replica with > stale data may make the whole query return that stale data. > For instance, consider 3 replicas A, B and C, and the following situation: > {noformat} > CREATE TABLE test (k int PRIMARY KEY, v1 text, v2 int); > CREATE INDEX ON test(v1); > INSERT INTO test(k, v1, v2) VALUES (0, 'foo', 1); > {noformat} > with every replica up to date. Now, suppose that the following queries are > done at {{QUORUM}}: > {noformat} > UPDATE test SET v2 = 2 WHERE k = 0; > SELECT * FROM test WHERE v1 = 'foo' AND v2 = 1; > {noformat} > then, if A and B acknowledge the insert but C respond to the read before > having applied the insert, then the now stale result will be returned. Let's > note that this is a problem related to filtering, not 2ndary indexes. > This issue share similarity with CASSANDRA-8272 but contrarily to that former > issue, I'm not sure how to fix it. Obviously, moving the filtering to the > coordinator would remove that problem, but doing so would, on top of not > being trivial to implmenent, have serious performance impact since we can't > know in advance how much data will be filtered and we may have to redo query > to replica multiple times. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8272) 2ndary indexes can return stale data
[ https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010605#comment-16010605 ] Sergio Bossa commented on CASSANDRA-8272: - bq. Here is a draft patch showing the approach. Excellent, I like it too. > 2ndary indexes can return stale data > > > Key: CASSANDRA-8272 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8272 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Andrés de la Peña > Fix For: 3.0.x > > > When replica return 2ndary index results, it's possible for a single replica > to return a stale result and that result will be sent back to the user, > potentially failing the CL contract. > For instance, consider 3 replicas A, B and C, and the following situation: > {noformat} > CREATE TABLE test (k int PRIMARY KEY, v text); > CREATE INDEX ON test(v); > INSERT INTO test(k, v) VALUES (0, 'foo'); > {noformat} > with every replica up to date. Now, suppose that the following queries are > done at {{QUORUM}}: > {noformat} > UPDATE test SET v = 'bar' WHERE k = 0; > SELECT * FROM test WHERE v = 'foo'; > {noformat} > then, if A and B acknowledge the insert but C respond to the read before > having applied the insert, then the now stale result will be returned (since > C will return it and A or B will return nothing). > A potential solution would be that when we read a tombstone in the index (and > provided we make the index inherit the gcGrace of it's parent CF), instead of > skipping that tombstone, we'd insert in the result a corresponding range > tombstone. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-8272) 2ndary indexes can return stale data
[ https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008026#comment-16008026 ] Sergio Bossa edited comment on CASSANDRA-8272 at 5/12/17 12:25 PM: --- bq. I just want us to have the discussion around filtering in CASSANDRA-8273 to avoid mixing things up, but If we can agree on moving filtering server-side there quickly, then I'm totally fine doing that and the indexing in a single patch if we prefer. I'm fine with moving filtering server side, and I'm fine with dealing with it on either issues provided we eventually address CASSANDRA-8273 straight away. bq. If you're making a general point [about API], then sure. Yes, I'm making a general point. Also see below. bq. we'd have to rely on the version exchanged through gossip in a way we never have, so with risks associated Alright, I don't have enough knowledge to provide a proper risk assessment on my own, so I'm fine with the previously proposed "split" patch. bq. I think we can and should make most things index agnostic here Totally agree here, and that's what I meant when saying we should make it work with all index implementations and devise APIs to leverage. I agree we should move away from using {{Index#postProcessorFor}}, and I agree we should keep the counting outside the index implementation (again, as it will be useful to all index implementations), but I'll leave the actual implementation details to [~adelapena] and discuss once we've got a first cut. So here's my take away about how we should move forward: 1) Provide a first patch based on 3.11 to fix the coordinator side. 2) Provide a second patch based on trunk to fix both coordinator and replica side (we could address this first, review, and backport the coordinator side once in agreement). 3) Discuss filtering in CASSANDRA-8273 and eventually address it here (this patch is getting quite meaty already, but OTOH CASSANDRA-8273 will require a split approach too, so we might want to make our life easier and do everything in one place). Thoughts? was (Author: sbtourist): bq. I just want us to have the discussion around filtering in CASSANDRA-8273 to avoid mixing things up, but If we can agree on moving filtering server-side there quickly, then I'm totally fine doing that and the indexing in a single patch if we prefer. I'm fine with moving filtering server side, and I'm fine with dealing with it on either issues provided we eventually address CASSANDRA-8273 straight away. bq. If you're making a general point [about API], then sure. Yes, I'm making a general point. Also see below. bq. we'd have to rely on the version exchanged through gossip in a way we never have, so with risks associated Alright, I don't have enough knowledge to provide a proper risk assessment on my own, so I'm fine with the previously proposed "split" patch. bq. I think we can and should make most things index agnostic here Totally agree here, and that's what I meant when saying we should make it work with all index implementations and devise APIs to leverage. I agree we should move away from using {{Index#postProcessorFor}}, and I agree we should keep the counting outside the index implementation (again, as it will be useful to all index implementations), but I'll leave the actual implementation details to [~adelapena] and discuss once we've got a first cut. So here's my take away about how we should move forward: 1) Provide a first patch based on 3.11 to fix the coordinator side. 2) Provide a second patch based on trunk to fix both coordinator and replica side (we could address this first, review, and backport the coordinator side once in agreement). 3) Leave filtering to CASSANDRA-8273, which we'll address straight after (this patch is getting quite meaty already and I'm now convinced it's indeed better to separate the two issues). Thoughts? > 2ndary indexes can return stale data > > > Key: CASSANDRA-8272 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8272 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Andrés de la Peña > Fix For: 3.0.x > > > When replica return 2ndary index results, it's possible for a single replica > to return a stale result and that result will be sent back to the user, > potentially failing the CL contract. > For instance, consider 3 replicas A, B and C, and the following situation: > {noformat} > CREATE TABLE test (k int PRIMARY KEY, v text); > CREATE INDEX ON test(v); > INSERT INTO test(k, v) VALUES (0, 'foo'); > {noformat} > with every replica up to date. Now, suppose that the following queries are > done at {{QUORUM}}: > {noformat} > UPDATE test SET v = 'bar' WHERE k = 0; > SELECT * FROM test WHERE v = 'foo'; > {noform
[jira] [Comment Edited] (CASSANDRA-8272) 2ndary indexes can return stale data
[ https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008026#comment-16008026 ] Sergio Bossa edited comment on CASSANDRA-8272 at 5/12/17 12:10 PM: --- bq. I just want us to have the discussion around filtering in CASSANDRA-8273 to avoid mixing things up, but If we can agree on moving filtering server-side there quickly, then I'm totally fine doing that and the indexing in a single patch if we prefer. I'm fine with moving filtering server side, and I'm fine with dealing with it on either issues provided we eventually address CASSANDRA-8273 straight away. bq. If you're making a general point [about API], then sure. Yes, I'm making a general point. Also see below. bq. we'd have to rely on the version exchanged through gossip in a way we never have, so with risks associated Alright, I don't have enough knowledge to provide a proper risk assessment on my own, so I'm fine with the previously proposed "split" patch. bq. I think we can and should make most things index agnostic here Totally agree here, and that's what I meant when saying we should make it work with all index implementations and devise APIs to leverage. I agree we should move away from using {{Index#postProcessorFor}}, and I agree we should keep the counting outside the index implementation (again, as it will be useful to all index implementations), but I'll leave the actual implementation details to [~adelapena] and discuss once we've got a first cut. So here's my take away about how we should move forward: 1) Provide a first patch based on 3.11 to fix the coordinator side. 2) Provide a second patch based on trunk to fix both coordinator and replica side (we could address this first, review, and backport the coordinator side once in agreement). 3) Leave filtering to CASSANDRA-8273, which we'll address straight after (this patch is getting quite meaty already and I'm now convinced it's indeed better to separate the two issues). Thoughts? was (Author: sbtourist): bq. I just want us to have the discussion around filtering in CASSANDRA-8273 to avoid mixing things up, but If we can agree on moving filtering server-side there quickly, then I'm totally fine doing that and the indexing in a single patch if we prefer. I'm fine with moving filtering server side, and I'm fine with dealing with it on either issues provided we eventually address CASSANDRA-8273 straight away. bq. If you're making a general point [about API], then sure. Yes, I'm making a general point. Also see below. bq. we'd have to rely on the version exchanged through gossip in a way we never have, so with risks associated Alright, I don't have enough knowledge to provide a proper risk assessment on my own, so I'm fine with the previously proposed "split" patch. bq. I think we can and should make most things index agnostic here Totally agree here, and that's what I meant when saying we should make it work with all index implementations and devise APIs to leverage. I agree we should move away from using {{Index#postProcessorFor}}, and I agree we should keep the counting outside the index implementation (again, as it will be useful to all index implementations), but I'll leave the actual implementation details to [~adelapena] and discuss once we've got a first cut. So here's my take away about how we should move forward: 1) Provide a first patch based on 3.11 to fix the coordinator side. 2) Provide a second patch based on trunk to fix both coordinator and replica side (we could address this first, review, and backport the coordinator side once in agreement). 3) Leave filtering to CASSANDRA-8273, which we'll address straight after (this patch is getting quite meaty already and I'm not convinced it's indeed better to separate the two issues). Thoughts? > 2ndary indexes can return stale data > > > Key: CASSANDRA-8272 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8272 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Andrés de la Peña > Fix For: 3.0.x > > > When replica return 2ndary index results, it's possible for a single replica > to return a stale result and that result will be sent back to the user, > potentially failing the CL contract. > For instance, consider 3 replicas A, B and C, and the following situation: > {noformat} > CREATE TABLE test (k int PRIMARY KEY, v text); > CREATE INDEX ON test(v); > INSERT INTO test(k, v) VALUES (0, 'foo'); > {noformat} > with every replica up to date. Now, suppose that the following queries are > done at {{QUORUM}}: > {noformat} > UPDATE test SET v = 'bar' WHERE k = 0; > SELECT * FROM test WHERE v = 'foo'; > {noformat} > then, if A and B acknowledge the insert but C respond to
[jira] [Commented] (CASSANDRA-8272) 2ndary indexes can return stale data
[ https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008026#comment-16008026 ] Sergio Bossa commented on CASSANDRA-8272: - bq. I just want us to have the discussion around filtering in CASSANDRA-8273 to avoid mixing things up, but If we can agree on moving filtering server-side there quickly, then I'm totally fine doing that and the indexing in a single patch if we prefer. I'm fine with moving filtering server side, and I'm fine with dealing with it on either issues provided we eventually address CASSANDRA-8273 straight away. bq. If you're making a general point [about API], then sure. Yes, I'm making a general point. Also see below. bq. we'd have to rely on the version exchanged through gossip in a way we never have, so with risks associated Alright, I don't have enough knowledge to provide a proper risk assessment on my own, so I'm fine with the previously proposed "split" patch. bq. I think we can and should make most things index agnostic here Totally agree here, and that's what I meant when saying we should make it work with all index implementations and devise APIs to leverage. I agree we should move away from using {{Index#postProcessorFor}}, and I agree we should keep the counting outside the index implementation (again, as it will be useful to all index implementations), but I'll leave the actual implementation details to [~adelapena] and discuss once we've got a first cut. So here's my take away about how we should move forward: 1) Provide a first patch based on 3.11 to fix the coordinator side. 2) Provide a second patch based on trunk to fix both coordinator and replica side (we could address this first, review, and backport the coordinator side once in agreement). 3) Leave filtering to CASSANDRA-8273, which we'll address straight after (this patch is getting quite meaty already and I'm not convinced it's indeed better to separate the two issues). Thoughts? > 2ndary indexes can return stale data > > > Key: CASSANDRA-8272 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8272 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Andrés de la Peña > Fix For: 3.0.x > > > When replica return 2ndary index results, it's possible for a single replica > to return a stale result and that result will be sent back to the user, > potentially failing the CL contract. > For instance, consider 3 replicas A, B and C, and the following situation: > {noformat} > CREATE TABLE test (k int PRIMARY KEY, v text); > CREATE INDEX ON test(v); > INSERT INTO test(k, v) VALUES (0, 'foo'); > {noformat} > with every replica up to date. Now, suppose that the following queries are > done at {{QUORUM}}: > {noformat} > UPDATE test SET v = 'bar' WHERE k = 0; > SELECT * FROM test WHERE v = 'foo'; > {noformat} > then, if A and B acknowledge the insert but C respond to the read before > having applied the insert, then the now stale result will be returned (since > C will return it and A or B will return nothing). > A potential solution would be that when we read a tombstone in the index (and > provided we make the index inherit the gcGrace of it's parent CF), instead of > skipping that tombstone, we'd insert in the result a corresponding range > tombstone. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007972#comment-16007972 ] Sergio Bossa commented on CASSANDRA-10130: -- bq. There is still a slight possibility of race though if an index is being built by some other thread and we mark the index as built after the streaming operation but before the other thread build is completed. There's actually a very probable race in case of parallel repairs when multiple streaming tasks for the same table are created. We should either find a better (race-free) place where to put those calls, or employ a guarded counter as suggested. > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10130) Node failure during 2i update after streaming can have incomplete 2i when restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006895#comment-16006895 ] Sergio Bossa commented on CASSANDRA-10130: -- I had a quick look at this one and I think we could avoid adding yet another system table by just reusing the {{BUILT_INDEXES}} table and following the same pattern used by {{SIM#rebuildIndexesBlocking()}}, that is: * Call {{SystemKeyspace#setIndexRemoved()}}. * Build index. * Call {{SystemKeyspace#setIndexBuilt()}}. Thoughts? Am I missing anything actually requiring a new table? > Node failure during 2i update after streaming can have incomplete 2i when > restarted > --- > > Key: CASSANDRA-10130 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10130 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Yuki Morishita >Assignee: Andrés de la Peña >Priority: Minor > > Since MV/2i update happens after SSTables are received, node failure during > MV/2i update can leave received SSTables live when restarted while MV/2i are > partially up to date. > We can add some kind of tracking mechanism to automatically rebuild at the > startup, or at least warn user when the node restarts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8272) 2ndary indexes can return stale data
[ https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006545#comment-16006545 ] Sergio Bossa commented on CASSANDRA-8272: - bq. my main point is that on principle we should be careful to look at the whole solution before comparing it to alternatives and deciding which one we "stick with". I've seen simple solutions get pretty messy once you fix all edge cases to the point that it wasn't the best solution anymore. Of course. And I indeed gave some thoughts on my own to the tombstones solution (as I'm sure [~adelapena] did as well), and I've found it quite more complex that the current one, with little/no gains in return, and, something I didn't mention before, not really complete for indexes covering multiple columns, or if we'll ever want to support multiple indexes per row: in such cases, mixing tombstones and valid column values for all combinations would easily turn into a mess IMHO, while actually returning the row and later post-filter is IMHO cleaner and less error prone. To be noted, we could still "skim" the row when we detect it's related to a stale entry and only keep the index-related columns (and easily add a merging step in the future for the multiple indexes cases): this would buy us the performance optimization you mentioned above, but I see it slightly error prone and I'd rather go with a functionally complete solution first. bq. It's in particular not true that fixing this bug will be "invalidated when filtering is applied" I disagree here: if filtering is applied on top of index results, you'll still get wrong results, which is confusing to me (as a user). I understand filtering is also orthogonal, so what about fixing filtering (that is, moving to coordinator-side filtering) only when indexes are present? bq. That [fixing other index implementations] I agree is something we should consider. Though tbh, I have doubts we can have a solution that is completely index agnostic. Of course. But we can still provide some API (i.e. the {{isSatisfiedBy()}} you mentioned) they can leverage. And if we do this kind of work on the SASI-enabled branches, we'll have two different index implementations to test the goodness of our API. bq. One thing that hasn't been mentioned is that the fix has impact on upgrades. Namely, in a mixed cluster, some replica will start to return invalid results and if the coordinator isn't upgraded yet, it won't filter those, which means we'll return invalid entries. Excellent point! And definitely something to avoid. bq. That does mean we should consider starting to filter entries on index queries coordinator-side in 3.0/3.11 (even though we never return them), and only do the replica-side parts in 4.0, with a fat warning that you need to only upgrade to 4.0 from a 3.X version that has the coordinator-side fix. Mmm ... clunky. And error prone as the 3.X code would be probably untestable. Couldn't the replica detect the coordinator version and return results accordingly? bq. Worth noting that this doesn't entirely fly for index using custom indexes: we'd need to have them implement the CustomExpression#isSatistiedBy method in 3.X in that scheme since we need it for the coordinator-side filtering as well, but making that method abstract in 3.X is, as said above, a breaking change. I'm not sure I get why you _have to_ make that abstract: I think it's fine to leave it as it is and warn users they'll have to override it on upgrade if they want consistent results. And for those implementations that can't implement it, we should maybe add a {{isConsistent}} predicate to disable "consistent filtering" altogether. > 2ndary indexes can return stale data > > > Key: CASSANDRA-8272 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8272 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Andrés de la Peña > Fix For: 3.0.x > > > When replica return 2ndary index results, it's possible for a single replica > to return a stale result and that result will be sent back to the user, > potentially failing the CL contract. > For instance, consider 3 replicas A, B and C, and the following situation: > {noformat} > CREATE TABLE test (k int PRIMARY KEY, v text); > CREATE INDEX ON test(v); > INSERT INTO test(k, v) VALUES (0, 'foo'); > {noformat} > with every replica up to date. Now, suppose that the following queries are > done at {{QUORUM}}: > {noformat} > UPDATE test SET v = 'bar' WHERE k = 0; > SELECT * FROM test WHERE v = 'foo'; > {noformat} > then, if A and B acknowledge the insert but C respond to the read before > having applied the insert, then the now stale result will be returned (since > C will return it and A or B will return nothing). >
[jira] [Commented] (CASSANDRA-8272) 2ndary indexes can return stale data
[ https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006297#comment-16006297 ] Sergio Bossa commented on CASSANDRA-8272: - One more thing I'm realizing we should fix is paging, which seems broken in a similar way to limiting: that is, if the page size is less than the number of mismatched rows we might end up going through "empty" pages until we get to the valid results. > 2ndary indexes can return stale data > > > Key: CASSANDRA-8272 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8272 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Andrés de la Peña > Fix For: 3.0.x > > > When replica return 2ndary index results, it's possible for a single replica > to return a stale result and that result will be sent back to the user, > potentially failing the CL contract. > For instance, consider 3 replicas A, B and C, and the following situation: > {noformat} > CREATE TABLE test (k int PRIMARY KEY, v text); > CREATE INDEX ON test(v); > INSERT INTO test(k, v) VALUES (0, 'foo'); > {noformat} > with every replica up to date. Now, suppose that the following queries are > done at {{QUORUM}}: > {noformat} > UPDATE test SET v = 'bar' WHERE k = 0; > SELECT * FROM test WHERE v = 'foo'; > {noformat} > then, if A and B acknowledge the insert but C respond to the read before > having applied the insert, then the now stale result will be returned (since > C will return it and A or B will return nothing). > A potential solution would be that when we read a tombstone in the index (and > provided we make the index inherit the gcGrace of it's parent CF), instead of > skipping that tombstone, we'd insert in the result a corresponding range > tombstone. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8272) 2ndary indexes can return stale data
[ https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006245#comment-16006245 ] Sergio Bossa commented on CASSANDRA-8272: - [~adelapena], I gave a first review pass and the approach looks sensible, so +1 on that. Unfortunately, the problem is actually quite subtle and there are at least a couple cases where it doesn't fully work. First of all, when a {{LIMIT}} clause is provided, the query might return no results when there actually are some valid ones: this is because the rows returned as a result of an "index mismatch" are still counted against the limit (by {{CQLCounter}}), which means the coordinator might end up with less valid rows than the requested limit, simply because some replicas returned only mismatched rows. Here's a simple scenario with two nodes: 1) Write row {{key=1,index=1}}. 2) Write row {{key=2,index=1}}. 3) Shutdown node 2. 4) Delete column {{index}} from row {{key=1}}: the delete will go to node 1, while node 2 will miss it. 5) Restart node 2 (hints need to be disabled). 6) Query for {{index=1}}. 7) Node 1 will return the first row found, i.e. the "mismatched" one {{key=1}}. 8) Node 2 will return the "missed delete" with {{key=1}}. 9) Coordinator will merge/post-process the rows, realize there's a mismatch and return no results, while it should have instead returned {{key=2}}. Second, this patch doesn't fix filtering; while it's true we have a different issue for that ({{CASSANDRA-8273}}), and while we could argue filtering isn't exactly a form of indexing, it is still used in conjunction with indexing, and fixing indexing just to have its results invalidated when filtering is applied seems quite confusing to me. In the end, I'd suggest the following: 1) Stick with the current approach! It's good and I do not think using special tombstones would buy us anything. 2) Fix the first problem above. 3) Generalize the approach so we can fix filtering and any other indexing implementation (most notably SASI). 4) To ease the burden of porting between versions, and given this is not a trivial bug fix at all, I'd also suggest to only apply it to 3.11 onwards. Thoughts? > 2ndary indexes can return stale data > > > Key: CASSANDRA-8272 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8272 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Andrés de la Peña > Fix For: 3.0.x > > > When replica return 2ndary index results, it's possible for a single replica > to return a stale result and that result will be sent back to the user, > potentially failing the CL contract. > For instance, consider 3 replicas A, B and C, and the following situation: > {noformat} > CREATE TABLE test (k int PRIMARY KEY, v text); > CREATE INDEX ON test(v); > INSERT INTO test(k, v) VALUES (0, 'foo'); > {noformat} > with every replica up to date. Now, suppose that the following queries are > done at {{QUORUM}}: > {noformat} > UPDATE test SET v = 'bar' WHERE k = 0; > SELECT * FROM test WHERE v = 'foo'; > {noformat} > then, if A and B acknowledge the insert but C respond to the read before > having applied the insert, then the now stale result will be returned (since > C will return it and A or B will return nothing). > A potential solution would be that when we read a tombstone in the index (and > provided we make the index inherit the gcGrace of it's parent CF), instead of > skipping that tombstone, we'd insert in the result a corresponding range > tombstone. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-8272) 2ndary indexes can return stale data
[ https://issues.apache.org/jira/browse/CASSANDRA-8272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-8272: Reviewer: Sergio Bossa > 2ndary indexes can return stale data > > > Key: CASSANDRA-8272 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8272 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Andrés de la Peña > Fix For: 3.0.x > > > When replica return 2ndary index results, it's possible for a single replica > to return a stale result and that result will be sent back to the user, > potentially failing the CL contract. > For instance, consider 3 replicas A, B and C, and the following situation: > {noformat} > CREATE TABLE test (k int PRIMARY KEY, v text); > CREATE INDEX ON test(v); > INSERT INTO test(k, v) VALUES (0, 'foo'); > {noformat} > with every replica up to date. Now, suppose that the following queries are > done at {{QUORUM}}: > {noformat} > UPDATE test SET v = 'bar' WHERE k = 0; > SELECT * FROM test WHERE v = 'foo'; > {noformat} > then, if A and B acknowledge the insert but C respond to the read before > having applied the insert, then the now stale result will be returned (since > C will return it and A or B will return nothing). > A potential solution would be that when we read a tombstone in the index (and > provided we make the index inherit the gcGrace of it's parent CF), instead of > skipping that tombstone, we'd insert in the result a corresponding range > tombstone. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13225) Best Consistency Level
[ https://issues.apache.org/jira/browse/CASSANDRA-13225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870593#comment-15870593 ] Sergio Bossa commented on CASSANDRA-13225: -- bq. Does the back-pressure strategy set at SLOW and a consistency level of Quorum mean that while we will only wait for 3 replicas out of 5 for the write to succeed we will overall be limiting the number of write requests to all replicas based on the slowest responses we are getting from the cluster? Yes. Specifically, the rate limiting is computed based on the ratio between sent messages and received responses. bq. Does this then mean that a large set of individual writes will be rate limited by the slowest write responses and we will in effect be waiting for all replicas even though each individual write is only waiting for Quorum replicas? Kinda. Keep in mind we're still not actually *waiting* for all replicas, but just rate limiting (so pausing) based on the feedback loop produced by messages sent and responses received. > Best Consistency Level > -- > > Key: CASSANDRA-13225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13225 > Project: Cassandra > Issue Type: New Feature >Reporter: Connor Warrington >Priority: Minor > > When writing data into a cluster there are a few consistency levels to choose > from. When choosing the consistency level to write with you are making a > tradeoff between consistency and failover availability. If you choose > consistency level ALL then all replicas have to be up and when a write > succeeds all replicas received the write. If you choose consistency level > QUORUM then a quorum number of replicas have to be up and when a write > succeeds at quorum number of replicas received the write. The tradeoff comes > in when there are more then quorum nodes available for the write. We would > like a write to succeed only when all replicas that are up have received the > write. Hence the suggestion of best as a consistency level. This would be > available for the existing consistency levels. The main idea behind this > feature request is that we are okay with a replica going down (fault > tolerance) but when the cluster is in a good state we don't mind waiting for > all nodes to get the write. This would enable the writer to operate at speed > of the slowest node instead of potentially getting into a state where that > slow node gets even further behind. This would also enable back pressure to > be better propagated through the system as the slowest node likely has back > pressure which is trying to tell the client about but if we don't wait for > that node the writer loses that information. > Example scenarios: > If we have replication factor of 3: > ALL consistency means 3 replicas have to be up and 3 replicas have to > successfully get the write. > QUORUM consistency means 2 replicas have to be up and 2 replicas have to > successfully get the write. > BEST_QUORUM consistency means 2 replicas have be up and all up replicas have > to successfully get the write. > If 3 replicas are up with replication factor of 3: > ALL would succeed as all 3 replicas are up and would return success when all > 3 replicas get the write > QUORUM would succeed as all 3 replicas are up and would return success when 2 > replicas get the write > BEST_QUORUM would succeed as all 3 replicas are up and would return success > when all 3 replicas get the write > If 2 replicas are up with replication factor of 3: > ALL would fail as only 2 replicas are up > QUORUM would succeed as 2 replicas are up and would return success when 2 > replicas get the write > BEST_QUORUM would succeed as 2 replicas are up and would return success when > 2 replicas get the write -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13225) Best Consistency Level
[ https://issues.apache.org/jira/browse/CASSANDRA-13225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869855#comment-15869855 ] Sergio Bossa commented on CASSANDRA-13225: -- bq. If what you are worried about is overloading nodes and want to get all 3 acks back to slow down the writes, then you might want to take a look at using a back_pressure_strategy, this is exactly the case that was implemented for, reducing the number of dropped mutations when all nodes are up and there is a constant stream of writes coming in. [~Connor Warrington], I fully second [~jjordan]'s suggestion here. The main problem with a "variable" CL is that under failure conditions it doesn't really give you any guarantees, as it will succeed if both quorum or "more than quorum" are met. Hence, it's practically equal to QUORUM in terms of guarantees. Configuring the back-pressure strategy at SLOW will probably be a better way to reduce the number of dropped mutations *and* allow hints to readily fix the ones you'll still get. > Best Consistency Level > -- > > Key: CASSANDRA-13225 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13225 > Project: Cassandra > Issue Type: New Feature >Reporter: Connor Warrington >Priority: Minor > > When writing data into a cluster there are a few consistency levels to choose > from. When choosing the consistency level to write with you are making a > tradeoff between consistency and failover availability. If you choose > consistency level ALL then all replicas have to be up and when a write > succeeds all replicas received the write. If you choose consistency level > QUORUM then a quorum number of replicas have to be up and when a write > succeeds at quorum number of replicas received the write. The tradeoff comes > in when there are more then quorum nodes available for the write. We would > like a write to succeed only when all replicas that are up have received the > write. Hence the suggestion of best as a consistency level. This would be > available for the existing consistency levels. The main idea behind this > feature request is that we are okay with a replica going down (fault > tolerance) but when the cluster is in a good state we don't mind waiting for > all nodes to get the write. This would enable the writer to operate at speed > of the slowest node instead of potentially getting into a state where that > slow node gets even further behind. This would also enable back pressure to > be better propagated through the system as the slowest node likely has back > pressure which is trying to tell the client about but if we don't wait for > that node the writer loses that information. > Example scenarios: > If we have replication factor of 3: > ALL consistency means 3 replicas have to be up and 3 replicas have to > successfully get the write. > QUORUM consistency means 2 replicas have to be up and 2 replicas have to > successfully get the write. > BEST_QUORUM consistency means 2 replicas have be up and all up replicas have > to successfully get the write. > If 3 replicas are up with replication factor of 3: > ALL would succeed as all 3 replicas are up and would return success when all > 3 replicas get the write > QUORUM would succeed as all 3 replicas are up and would return success when 2 > replicas get the write > BEST_QUORUM would succeed as all 3 replicas are up and would return success > when all 3 replicas get the write > If 2 replicas are up with replication factor of 3: > ALL would fail as only 2 replicas are up > QUORUM would succeed as 2 replicas are up and would return success when 2 > replicas get the write > BEST_QUORUM would succeed as 2 replicas are up and would return success when > 2 replicas get the write -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13075) Indexer is not correctly invoked when building indexes over sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830158#comment-15830158 ] Sergio Bossa commented on CASSANDRA-13075: -- +1 > Indexer is not correctly invoked when building indexes over sstables > > > Key: CASSANDRA-13075 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13075 > Project: Cassandra > Issue Type: Bug >Reporter: Sergio Bossa >Assignee: Alex Petrov >Priority: Critical > Attachments: CustomIndexTest.java > > > Following CASSANDRA-12796, {{SecondaryIndexManager#indexPartition()}} calls > each {{Indexer}} {{begin}} and {{finish}} methods multiple times per > partition (depending on the page size), as > {{PartitionIterators#getOnlyElement()}} returns an empty partition even when > the iterator is exhausted. > This leads to bugs for {{Indexer}} implementations doing actual work in those > methods, but even worse, it provides the {{Indexer}} the same input of an > empty partition containing only a non-live partition deletion, as the > {{Indexer#partitionDelete()}} method is *not* actually called. > My proposed solution: > 1) Stop the iteration before the empty partition is returned and ingested > into the {{Indexer}}. > 2) Actually call the {{Indexer#partitionDelete()}} method inside > {{SecondaryIndexManager#indexPartition()}} (which requires to use a filtered > iterator so it actually contains the deletion info). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-13075) Indexer is not correctly invoked when building indexes over sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828347#comment-15828347 ] Sergio Bossa commented on CASSANDRA-13075: -- Looks better now [~ifesdjeen]! Just a few more points: 1) As you mentioned, it is missing a test with range tombstones crossing pages. 2) [This|https://github.com/ifesdjeen/cassandra/commit/7aa0c7c41960ed11b35545c49ca7e8f869f76e2f#diff-13cb97758bcb11cce8fc6f4cb1990dd6R729] should use {{pageSize}}. 3) [This|https://github.com/ifesdjeen/cassandra/commit/7aa0c7c41960ed11b35545c49ca7e8f869f76e2f#diff-13cb97758bcb11cce8fc6f4cb1990dd6R785] should be probably replaced with {{StubIndex}} (I've used that other one just as quick hack). > Indexer is not correctly invoked when building indexes over sstables > > > Key: CASSANDRA-13075 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13075 > Project: Cassandra > Issue Type: Bug >Reporter: Sergio Bossa >Assignee: Alex Petrov >Priority: Critical > Attachments: CustomIndexTest.java > > > Following CASSANDRA-12796, {{SecondaryIndexManager#indexPartition()}} calls > each {{Indexer}} {{begin}} and {{finish}} methods multiple times per > partition (depending on the page size), as > {{PartitionIterators#getOnlyElement()}} returns an empty partition even when > the iterator is exhausted. > This leads to bugs for {{Indexer}} implementations doing actual work in those > methods, but even worse, it provides the {{Indexer}} the same input of an > empty partition containing only a non-live partition deletion, as the > {{Indexer#partitionDelete()}} method is *not* actually called. > My proposed solution: > 1) Stop the iteration before the empty partition is returned and ingested > into the {{Indexer}}. > 2) Actually call the {{Indexer#partitionDelete()}} method inside > {{SecondaryIndexManager#indexPartition()}} (which requires to use a filtered > iterator so it actually contains the deletion info). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-13075) Indexer is not correctly invoked when building indexes over sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826423#comment-15826423 ] Sergio Bossa commented on CASSANDRA-13075: -- [~ifesdjeen], I believe there are a few problems with your patch: * [Passing|https://github.com/ifesdjeen/cassandra/commit/803d3c6ea6e6852103204c7d0bb46001be191ab2#diff-3f2c8994c4ff8748c3faf7e70958520dR550] the partition columns for the metadata definition is subtly different than passing the columns for the read partition, as the latter could be a subset of the former (at least according to javadocs and a brief code inspection); I'm not sure how much it matters in practice, but that could potentially lead to unwanted index calls. * [Processing|https://github.com/ifesdjeen/cassandra/commit/803d3c6ea6e6852103204c7d0bb46001be191ab2#diff-3f2c8994c4ff8748c3faf7e70958520dR567] the collected range tombstones inside the loop, piggybacking on the fact the loop itself will do one more iteration after exhaustion, doesn't actually work, that is, it seems range tombstones are never processed; this can be easily verified by checking range tombstones [here|https://github.com/ifesdjeen/cassandra/commit/803d3c6ea6e6852103204c7d0bb46001be191ab2#diff-13cb97758bcb11cce8fc6f4cb1990dd6R754] (as this test _should_ check for range tombstones, unless I'm missing something?). * Even if the last point worked, I believe that would lead to a duplicate {{begin}} and {{finish}} call for range tombstones. * [Unused|https://github.com/ifesdjeen/cassandra/commit/803d3c6ea6e6852103204c7d0bb46001be191ab2#diff-71513a85468b7cbc97f1e820b06a20a8R125]? > Indexer is not correctly invoked when building indexes over sstables > > > Key: CASSANDRA-13075 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13075 > Project: Cassandra > Issue Type: Bug >Reporter: Sergio Bossa >Assignee: Alex Petrov >Priority: Critical > Attachments: CustomIndexTest.java > > > Following CASSANDRA-12796, {{SecondaryIndexManager#indexPartition()}} calls > each {{Indexer}} {{begin}} and {{finish}} methods multiple times per > partition (depending on the page size), as > {{PartitionIterators#getOnlyElement()}} returns an empty partition even when > the iterator is exhausted. > This leads to bugs for {{Indexer}} implementations doing actual work in those > methods, but even worse, it provides the {{Indexer}} the same input of an > empty partition containing only a non-live partition deletion, as the > {{Indexer#partitionDelete()}} method is *not* actually called. > My proposed solution: > 1) Stop the iteration before the empty partition is returned and ingested > into the {{Indexer}}. > 2) Actually call the {{Indexer#partitionDelete()}} method inside > {{SecondaryIndexManager#indexPartition()}} (which requires to use a filtered > iterator so it actually contains the deletion info). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-13075) Indexer is not correctly invoked when building indexes over sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-13075: - Attachment: CustomIndexTest.java [~ifesdjeen], please see the attached {{CustomIndexTest.java}}: it adds a quickly hacked {{partitionIsNotOverIndexed}} test which shows how the {{finish}} method is invoked twice even if only a single partition with a single row is present. > Indexer is not correctly invoked when building indexes over sstables > > > Key: CASSANDRA-13075 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13075 > Project: Cassandra > Issue Type: Bug >Reporter: Sergio Bossa >Assignee: Alex Petrov >Priority: Critical > Attachments: CustomIndexTest.java > > > Following CASSANDRA-12796, {{SecondaryIndexManager#indexPartition()}} calls > each {{Indexer}} {{begin}} and {{finish}} methods multiple times per > partition (depending on the page size), as > {{PartitionIterators#getOnlyElement()}} returns an empty partition even when > the iterator is exhausted. > This leads to bugs for {{Indexer}} implementations doing actual work in those > methods, but even worse, it provides the {{Indexer}} the same input of an > empty partition containing only a non-live partition deletion, as the > {{Indexer#partitionDelete()}} method is *not* actually called. > My proposed solution: > 1) Stop the iteration before the empty partition is returned and ingested > into the {{Indexer}}. > 2) Actually call the {{Indexer#partitionDelete()}} method inside > {{SecondaryIndexManager#indexPartition()}} (which requires to use a filtered > iterator so it actually contains the deletion info). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-13075) Indexer is not correctly invoked when building indexes over sstables
[ https://issues.apache.org/jira/browse/CASSANDRA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15811740#comment-15811740 ] Sergio Bossa commented on CASSANDRA-13075: -- [~ifesdjeen], my description above might have been misleading, as I do not think the actual problem is calling {{start}} and {{finish}} just once: as long as they are called together, it will be fine, even if they're called multiple times for the same partition *but different subsets of rows*. The real problem to me is they are called an additional time even when the partition is exhausted. The solution you proposed could work on paper in terms of avoiding those methods being called multiple times (at all), even if practically speaking could be troublesome, but only for the first part of the problem: what about calling {{Indexer#partitionDelete()}}? > Indexer is not correctly invoked when building indexes over sstables > > > Key: CASSANDRA-13075 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13075 > Project: Cassandra > Issue Type: Bug >Reporter: Sergio Bossa >Assignee: Alex Petrov >Priority: Critical > > Following CASSANDRA-12796, {{SecondaryIndexManager#indexPartition()}} calls > each {{Indexer}} {{begin}} and {{finish}} methods multiple times per > partition (depending on the page size), as > {{PartitionIterators#getOnlyElement()}} returns an empty partition even when > the iterator is exhausted. > This leads to bugs for {{Indexer}} implementations doing actual work in those > methods, but even worse, it provides the {{Indexer}} the same input of an > empty partition containing only a non-live partition deletion, as the > {{Indexer#partitionDelete()}} method is *not* actually called. > My proposed solution: > 1) Stop the iteration before the empty partition is returned and ingested > into the {{Indexer}}. > 2) Actually call the {{Indexer#partitionDelete()}} method inside > {{SecondaryIndexManager#indexPartition()}} (which requires to use a filtered > iterator so it actually contains the deletion info). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-13075) Indexer is not correctly invoked when building indexes over sstables
Sergio Bossa created CASSANDRA-13075: Summary: Indexer is not correctly invoked when building indexes over sstables Key: CASSANDRA-13075 URL: https://issues.apache.org/jira/browse/CASSANDRA-13075 Project: Cassandra Issue Type: Bug Reporter: Sergio Bossa Priority: Critical Following CASSANDRA-12796, {{SecondaryIndexManager#indexPartition()}} calls each {{Indexer}} {{begin}} and {{finish}} methods multiple times per partition (depending on the page size), as {{PartitionIterators#getOnlyElement()}} returns an empty partition even when the iterator is exhausted. This leads to bugs for {{Indexer}} implementations doing actual work in those methods, but even worse, it provides the {{Indexer}} the same input of an empty partition containing only a non-live partition deletion, as the {{Indexer#partitionDelete()}} method is *not* actually called. My proposed solution: 1) Stop the iteration before the empty partition is returned and ingested into the {{Indexer}}. 2) Actually call the {{Indexer#partitionDelete()}} method inside {{SecondaryIndexManager#indexPartition()}} (which requires to use a filtered iterator so it actually contains the deletion info). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12836) Set JOINING mode before executing the pre-join index callback
[ https://issues.apache.org/jira/browse/CASSANDRA-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611989#comment-15611989 ] Sergio Bossa commented on CASSANDRA-12836: -- Thanks! > Set JOINING mode before executing the pre-join index callback > - > > Key: CASSANDRA-12836 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12836 > Project: Cassandra > Issue Type: Improvement >Reporter: Sergio Bossa >Assignee: Sergio Bossa > Fix For: 3.10 > > > The pre-join index callback introduced by CASSANDRA-12039 might perform long > running tasks, and given it's blocking, it would be good to set the node in > JOINING mode, which currently only happens in case of bootstrap, so that: > 1) The mode can be properly read by tools. > 2) Async callback implementation can read the mode themselves to verify at > which point of the startup process they are executing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12836) Set JOINING mode before executing the pre-join index callback
[ https://issues.apache.org/jira/browse/CASSANDRA-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergio Bossa updated CASSANDRA-12836: - Issue Type: Improvement (was: Bug) > Set JOINING mode before executing the pre-join index callback > - > > Key: CASSANDRA-12836 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12836 > Project: Cassandra > Issue Type: Improvement >Reporter: Sergio Bossa >Assignee: Sergio Bossa > > The pre-join index callback introduced by CASSANDRA-12039 might perform long > running tasks, and given it's blocking, it would be good to set the node in > JOINING mode, which currently only happens in case of bootstrap, so that: > 1) The mode can be properly read by tools. > 2) Async callback implementation can read the mode themselves to verify at > which point of the startup process they are executing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)