[ 
https://issues.apache.org/jira/browse/CASSANDRA-1991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982449#action_12982449
 ] 

Peter Schuller commented on CASSANDRA-1991:
-------------------------------------------

I tried very hard to construct a sequence of events with v5 that would result 
in unintended log reply, and I think I managed. Sorry if I'm wrong and wasting 
more time here, but:

(1) writes come in to a CF, memtable is dirty
(2) writes stop coming in, but the very last row mutation causes the memtable 
to exceed its throughput threshold
(3) as a result of the threshold, the memtable is flushed by regular means; no 
synch; maybeSwitchMemtable() is called with a conservative commit log context
(4) drain is called -> commit log is synched (context is now good)
(5) drain calls forceBlockingFlush()
(6) forceFlush() sees that the memtable is clean, and just returns without 
calling maybeSwitchMemtable()

The end result should be that on start-up, there is un-intended log reply. It 
works normally for a dirty memtable because the sync() in drain happens prior 
to the final flush. But because if the memtable is in fact clean already prior 
to the synch(), but was clean as a result of a previous flush with a 
conservative commit log segment, the discard never happens.

This would be fixed by moving the segment discard submission into forceFlush(), 
and not have it be specific to the expiry path. Will attach -v6 which does 
that, except it always sync:s and maintain the "all up to now is flushed and 
recorded as such in commit log" property even if writes are concurrently 
happening and the memtable happened to be switched in the mean time. Ideally, 
the commit log sync should be avoided when not actually needed, but it is 
probably questionable whether that is worth it.

An alternative would be to just move the same logic and not add the 
always-sync. Under the assumption that no one is ever interested in the 
property that "log entries won't be replayed" except under circumstances where 
writes are in fact not actively coming in, this should be fine. But it does 
make the semantics of the call pretty complex.

Also, independently of the issues here, I just realized: Given that the commit 
log context we grab in forceFlush() (forceFlushOrDiscardIfExpire() in -v5) 
happens in undetermined order relative to a concurrent maybeSwitchMemtable(), 
is it not unsafe that we are then (at some future point, in undermined order 
relative to the runnable submitted by maybeSwitchMemtable) blindly updating the 
commit log segment with the commit log context we got; should not either 
discardCompletedSegmentsInternal or else CommitLogHeader.turnOn() refuse to use 
a position that would move the pointer *backwards*?


> CFS.maybeSwitchMemtable() calls CommitLog.instance.getContext(), which may 
> block, under flusher lock write lock
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-1991
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1991
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>         Attachments: 1991-checkpointing-flush.txt, 1991-logchanges.txt, 
> 1991-trunk-v2.txt, 1991-trunk.txt, 1991-v3.txt, 1991-v4.txt, 1991-v5.txt, 
> 1991-v6.txt, trigger.py
>
>
> While investigate CASSANDRA-1955 I realized I was seeing very poor latencies 
> for reasons that had nothing to do with flush_writers, even when using 
> periodic commit log mode (and flush writers set ridiculously high, 500).
> It turns out writes blocked were slow because Table.apply() was spending lots 
> of time (I can easily trigger seconds on moderate work-load) trying to 
> acquire a flusher lock read lock ("flush lock millis" log printout in the 
> logging patch I'll attach).
> That in turns is caused by CFS.maybeSwitchMemtable() which acquires the 
> flusher lock write lock.
> Bisecting further revealed that the offending line of code that blocked was:
>                     final CommitLogSegment.CommitLogContext ctx = 
> writeCommitLog ? CommitLog.instance.getContext() : null;
> Indeed, CommitLog.getContext() simply returns currentSegment().getContext(), 
> but does so by submitting a callable on the service executor. So 
> independently of flush writers, this can block all (global, for all cf:s) 
> writes very easily, and does.
> I'll attach a file that is an independent Python script that triggers it on 
> my macos laptop (with an intel SSD, which is why I was particularly 
> surprised) (it assumes CPython, out-of-the-box-or-almost Cassandra on 
> localhost that isn't in a cluster, and it will drop/recreate a keyspace 
> called '1955').
> I'm also attaching, just FYI, the patch with log entries that I used while 
> tracking it down.
> Finally, I'll attach a patch with a suggested solution of keeping track of 
> the latest commit log with an AtomicReference (as an alternative to 
> synchronizing all access to segments). With that patch applied, latencies are 
> not affected by my trigger case like they were before. There are some 
> sub-optimal > 100 ms cases on my test machine, but for other reasons. I'm no 
> longer able to trigger the extremes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to