Hi Paolo, First of all thanks for your review!
I had the same concerns as you but I thought it is beeing handled correctly (which does in some situations) but I found one that creates the inconsistencies you mentioned. That is kind of split brain syndrom, when multiple nodes fail between repairs. See here: https://cl.ly/3t0X1c0q1L1h. I am not happy about it but I support your decision. We should then add another dtest to test this scenario as existing dtests don't. Some issues unfortunately remain: - 12888 is not resolved - MV repairs may be still f**** slow. Imagine an inconsistency of a single cell (also may be due to a validation race condition, see CASSANDRA-12991) on a big partition. I had issues with reaper and a 30min timeout leading to 1000+ (yes!) consecutive repairs of a single subrange because it always timed out and I recognized very late. When I deployed 12888 on my system, this remaining subrange was repaired in a snap - I guess rebuild works the same as repair and has to go through the write path, right? => The MV repair may induce so much overhead that it is maybe cheaper to kill and replace a inconsistent node than to repair it. But that may introduce inconsistencies again. All in all it is not perfect. All this does not really un-frustrate me a 100%. Do you have any more thoughts? Unfortunately I have very little time these days as my second child was born on monday. So thanks for your support so far. Maybe I have some ideas on this issues during the next days and I will work on that ticket probably next week to come to a solution that is at least deployable. I'd also appreciate your opinion on CASSANDRA-12991. 2016-12-07 2:53 GMT+01:00 Paulo Motta <pauloricard...@gmail.com>: > Hello Benjamin, > > Thanks for your effort on this investigation! For bootstraps and range > transfers, I think we can indeed simplify and stream base tables and MVs as > ordinary tables, unless there is some caveat I'm missing (I didn't find any > special case for bootstrap/range transfers on CASSANDRA-6477 or in the MV > design doc, please correct me if I'm wrong). > > Regarding repair of base tables, applying mutations via the write path is a > matter of correctness, given that the base table updates needs to > potentially remove previously referenced keys in the views, so repairing > only the base table may leave unreferenced keys in the views, breaking the > MV contract. Furthermore, these unreferenced keys may be propagated to > other replicas and never removed if you repair only the view. If you don't > do overwrites in the base table, this is probably not a problem but the DB > cannot ensure this (at least not before CASSANDRA-9779). Furthermore, as > you already noticed repairing only the base table is probably faster so I > don't see a reason to repair the base and MVs separately since this is > potentially more costly. I believe your frustration is mostly due to the > bug described on CASSANDRA-12905, but after that and CASSANDRA-12888 are > fixed repair on base table should work just fine. > > Based on this, I propose: > - Fix CASSANDRA-12905 with your original patch that retries acquiring the > MV lock instead of throwing WriteTimeoutException during streaming, since > this is blocking 3.10. > - Fix CASSANDRA-12888 by doing sstable-based streaming for base tables > while still applying MV updates in the paired replicas. > - Create new ticket to use ordinary streaming for non-repair MV stream > sessions and keep current behavior for MV streaming originating from > repair. > - Create new ticket to include only the base tables and not MVs in > keyspace-level repair, since repairing the base already repairs the views > to avoid people shooting themselves in the foot. > > Please let me know what do you think. Any suggestions or feedback is > appreciated. > > Cheers, > > Paulo > > 2016-12-02 8:27 GMT-02:00 Benjamin Roth <benjamin.r...@jaumo.com>: > > > As I haven't received a single reply on that, I went over to implement > and > > test it on my own with our production cluster. I had a real pain with > > bringing up a new node, so I had to move on. > > > > Result: > > Works like a charm. I ran many dtests that relate in any way with > storage, > > stream, bootstrap, ... with good results. > > The bootstrap finished in under 5:30h, not a single error log during > > bootstrap. Also afterwards, repairs run smooth, cluster seems to operate > > quite well. > > > > I still need: > > > > - Reviews (see 12888, 12905, 12984) > > - Some opinion if I did the CDC case right. IMHO CDC is not required > on > > bootstrap and we don't need to send the mutations through the write > path > > just to write the commit log. This will also break incremental > repairs. > > Instead for CDC the sstables are streamed like normal but mutations > are > > written to commitlog additionally. The worst I see is that the node > > crashes > > and the commitlogs for that repair streams are replayed leading to > > duplicate writes, which is not really crucial and not a regular case. > > Any > > better ideas? > > - Docs have to be updated (12985) if patch is accepted > > > > I really appreciate ANY feedback. IMHO the impact of that fixes is > immense > > and maybe will be a huge step to get MVs production ready. > > > > Thank you very much, > > Benjamin > > > > > > ---------- Forwarded message ---------- > > From: Benjamin Roth <benjamin.r...@jaumo.com> > > Date: 2016-11-29 17:04 GMT+01:00 > > Subject: Streaming and MVs > > To: dev@cassandra.apache.org > > > > > > I don't know where else to discuss this issue, so I post it here. > > > > I am trying to get CS to run stable with MVs since the beginning of july. > > Normal reads + write do work as expected but when it comes to repairs or > > bootstrapping it still feels far far away from what I would call fast and > > stable. The other day I just wanted to bootstrap a new node. I tried it 2 > > times. > > First time the bootstrap failed due to WTEs. I fixed this issue by not > > timing out in streams but then it turned out that the bootstrap (load > > roughly 250-300 GB) didn't even finish in 24h. What if I really had a > > problem and had to get up some nodes fast? No way! > > > > I think the root cause of it all is the way streams are handled on tables > > with MVs. > > Sending them to the regular write path implies so many bottlenecks and > > sometimes also redundant writes. Let me explain: > > > > 1. Bootstrap > > During a bootstrap, all ranges from all KS and all CFs that will belong > to > > the new node will be streamed. MVs are treated like all other CFs and all > > ranges that will move to the new node will also be streamed during > > bootstrap. > > Sending streams of the base tables through the write path will have the > > following negative impacts: > > > > - Writes are sent to the commit log. Not necessary. When node is > stopped > > during bootstrap, bootstrap will simply start over. No need to recover > > from > > commit logs. Non-MV tables won't have a CL anyway > > - MV mutations will not be applied instantly but send to the batch > log. > > This is of course necessary during the range movement (if PK of MV > > differs > > from base table) but what happens: The batchlog will be completely > > flooded. > > This leads to ridiculously large batchlogs (I observerd BLs with 60GB > > size), zillions of compactions and quadrillions of tombstones. This > is a > > pure resource killer, especially because BL uses a CF as a queue. > > - Applying every mutation separately causes read-before-writes during > MV > > mutation. This is of course an order of magnitude slower than simply > > streaming down an SSTable. This effect becomes even worse while > > bootstrap > > progresses and creates more and more (uncompacted) SSTables. Many of > > them > > wont ever be compacted because the batchlog eats all the resources > > available for compaction > > - Streaming down the MV tables AND applying the mutations of the > > basetables leads to redundant writes. Redundant writes are local if PK > > of > > the MV == PK of the base table and - even worse - remote if not. > Remote > > MV > > updates will impact nodes that aren't even part of the bootstrap. > > - CDC should also not be necessary during bootstrap, should it? TBD > > > > 2. Repair > > Negative impact is similar to bootstrap but, ... > > > > - Sending repairs through write path will not mark the streamed tables > > as repaired. See CASSANDRA-12888. Doing NOT so will instantly solve > that > > issue. Much simpler with any other solution > > - It will change the "repair design" a bit. Repairing a base table > will > > not automatically repair the MV. But is this bad at all? To be honest > > as a > > newbie it was very hard for me to understand what I had to do to be > sure > > that everything is repaired correctly. Recently I was told NOT to > > repair MV > > CFs but only to repair the base tables. This means one cannot just > call > > "nodetool repair $keyspace" - this is complicated, not transparent and > > it > > sucks. I changed the behaviour in my own branch and let run the dtests > > for > > MVs. 2 tests failed: > > - base_replica_repair_test of course failes due to the design > change > > - really_complex_repair_test fails because it intentionally times > out > > the batch log. IMHO this is a bearable situation. It is comparable > to > > resurrected tombstones when running a repair after GCGS expired. > You > > also > > would not expect this to be magically fixed. gcgs default is 10 > > days and I > > can expect that anybody also repairs its MVs during that period, > not > > only > > the base table > > > > 3. Rebuild > > Same like bootstrap, isn't it? > > > > Did I forget any cases? > > What do you think? > > > > -- > > Benjamin Roth > > Prokurist > > > > Jaumo GmbH · www.jaumo.com > > Wehrstraße 46 · 73035 Göppingen · Germany > > Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1 > > <07161%203048801> > > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer > > > > > > > > -- > > Benjamin Roth > > Prokurist > > > > Jaumo GmbH · www.jaumo.com > > Wehrstraße 46 · 73035 Göppingen · Germany > > Phone +49 7161 304880-6 · Fax +49 7161 304880-1 > > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer > > > -- Benjamin Roth Prokurist Jaumo GmbH · www.jaumo.com Wehrstraße 46 · 73035 Göppingen · Germany Phone +49 7161 304880-6 · Fax +49 7161 304880-1 AG Ulm · HRB 731058 · Managing Director: Jens Kammerer