Grmpf! 1000+ consecutive must be wrong. I guess I mixed sth up. But it repaired over and over again for 1 or 2 days.
2016-12-07 9:01 GMT+01:00 Benjamin Roth <benjamin.r...@jaumo.com>: > Hi Paolo, > > First of all thanks for your review! > > I had the same concerns as you but I thought it is beeing handled > correctly (which does in some situations) but I found one that creates the > inconsistencies you mentioned. That is kind of split brain syndrom, when > multiple nodes fail between repairs. See here: https://cl.ly/3t0X1c0q1L1h. > > I am not happy about it but I support your decision. We should then add > another dtest to test this scenario as existing dtests don't. > > Some issues unfortunately remain: > - 12888 is not resolved > - MV repairs may be still f**** slow. Imagine an inconsistency of a single > cell (also may be due to a validation race condition, see CASSANDRA-12991) > on a big partition. I had issues with reaper and a 30min timeout leading to > 1000+ (yes!) consecutive repairs of a single subrange because it always > timed out and I recognized very late. When I deployed 12888 on my system, > this remaining subrange was repaired in a snap > - I guess rebuild works the same as repair and has to go through the write > path, right? > > => The MV repair may induce so much overhead that it is maybe cheaper to > kill and replace a inconsistent node than to repair it. But that may > introduce inconsistencies again. All in all it is not perfect. All this > does not really un-frustrate me a 100%. > > Do you have any more thoughts? > > Unfortunately I have very little time these days as my second child was > born on monday. So thanks for your support so far. Maybe I have some ideas > on this issues during the next days and I will work on that ticket probably > next week to come to a solution that is at least deployable. I'd also > appreciate your opinion on CASSANDRA-12991. > > 2016-12-07 2:53 GMT+01:00 Paulo Motta <pauloricard...@gmail.com>: > >> Hello Benjamin, >> >> Thanks for your effort on this investigation! For bootstraps and range >> transfers, I think we can indeed simplify and stream base tables and MVs >> as >> ordinary tables, unless there is some caveat I'm missing (I didn't find >> any >> special case for bootstrap/range transfers on CASSANDRA-6477 or in the MV >> design doc, please correct me if I'm wrong). >> >> Regarding repair of base tables, applying mutations via the write path is >> a >> matter of correctness, given that the base table updates needs to >> potentially remove previously referenced keys in the views, so repairing >> only the base table may leave unreferenced keys in the views, breaking the >> MV contract. Furthermore, these unreferenced keys may be propagated to >> other replicas and never removed if you repair only the view. If you don't >> do overwrites in the base table, this is probably not a problem but the DB >> cannot ensure this (at least not before CASSANDRA-9779). Furthermore, as >> you already noticed repairing only the base table is probably faster so I >> don't see a reason to repair the base and MVs separately since this is >> potentially more costly. I believe your frustration is mostly due to the >> bug described on CASSANDRA-12905, but after that and CASSANDRA-12888 are >> fixed repair on base table should work just fine. >> >> Based on this, I propose: >> - Fix CASSANDRA-12905 with your original patch that retries acquiring the >> MV lock instead of throwing WriteTimeoutException during streaming, since >> this is blocking 3.10. >> - Fix CASSANDRA-12888 by doing sstable-based streaming for base tables >> while still applying MV updates in the paired replicas. >> - Create new ticket to use ordinary streaming for non-repair MV stream >> sessions and keep current behavior for MV streaming originating from >> repair. >> - Create new ticket to include only the base tables and not MVs in >> keyspace-level repair, since repairing the base already repairs the views >> to avoid people shooting themselves in the foot. >> >> Please let me know what do you think. Any suggestions or feedback is >> appreciated. >> >> Cheers, >> >> Paulo >> >> 2016-12-02 8:27 GMT-02:00 Benjamin Roth <benjamin.r...@jaumo.com>: >> >> > As I haven't received a single reply on that, I went over to implement >> and >> > test it on my own with our production cluster. I had a real pain with >> > bringing up a new node, so I had to move on. >> > >> > Result: >> > Works like a charm. I ran many dtests that relate in any way with >> storage, >> > stream, bootstrap, ... with good results. >> > The bootstrap finished in under 5:30h, not a single error log during >> > bootstrap. Also afterwards, repairs run smooth, cluster seems to operate >> > quite well. >> > >> > I still need: >> > >> > - Reviews (see 12888, 12905, 12984) >> > - Some opinion if I did the CDC case right. IMHO CDC is not required >> on >> > bootstrap and we don't need to send the mutations through the write >> path >> > just to write the commit log. This will also break incremental >> repairs. >> > Instead for CDC the sstables are streamed like normal but mutations >> are >> > written to commitlog additionally. The worst I see is that the node >> > crashes >> > and the commitlogs for that repair streams are replayed leading to >> > duplicate writes, which is not really crucial and not a regular case. >> > Any >> > better ideas? >> > - Docs have to be updated (12985) if patch is accepted >> > >> > I really appreciate ANY feedback. IMHO the impact of that fixes is >> immense >> > and maybe will be a huge step to get MVs production ready. >> > >> > Thank you very much, >> > Benjamin >> > >> > >> > ---------- Forwarded message ---------- >> > From: Benjamin Roth <benjamin.r...@jaumo.com> >> > Date: 2016-11-29 17:04 GMT+01:00 >> > Subject: Streaming and MVs >> > To: dev@cassandra.apache.org >> > >> > >> > I don't know where else to discuss this issue, so I post it here. >> > >> > I am trying to get CS to run stable with MVs since the beginning of >> july. >> > Normal reads + write do work as expected but when it comes to repairs or >> > bootstrapping it still feels far far away from what I would call fast >> and >> > stable. The other day I just wanted to bootstrap a new node. I tried it >> 2 >> > times. >> > First time the bootstrap failed due to WTEs. I fixed this issue by not >> > timing out in streams but then it turned out that the bootstrap (load >> > roughly 250-300 GB) didn't even finish in 24h. What if I really had a >> > problem and had to get up some nodes fast? No way! >> > >> > I think the root cause of it all is the way streams are handled on >> tables >> > with MVs. >> > Sending them to the regular write path implies so many bottlenecks and >> > sometimes also redundant writes. Let me explain: >> > >> > 1. Bootstrap >> > During a bootstrap, all ranges from all KS and all CFs that will belong >> to >> > the new node will be streamed. MVs are treated like all other CFs and >> all >> > ranges that will move to the new node will also be streamed during >> > bootstrap. >> > Sending streams of the base tables through the write path will have the >> > following negative impacts: >> > >> > - Writes are sent to the commit log. Not necessary. When node is >> stopped >> > during bootstrap, bootstrap will simply start over. No need to >> recover >> > from >> > commit logs. Non-MV tables won't have a CL anyway >> > - MV mutations will not be applied instantly but send to the batch >> log. >> > This is of course necessary during the range movement (if PK of MV >> > differs >> > from base table) but what happens: The batchlog will be completely >> > flooded. >> > This leads to ridiculously large batchlogs (I observerd BLs with 60GB >> > size), zillions of compactions and quadrillions of tombstones. This >> is a >> > pure resource killer, especially because BL uses a CF as a queue. >> > - Applying every mutation separately causes read-before-writes >> during MV >> > mutation. This is of course an order of magnitude slower than simply >> > streaming down an SSTable. This effect becomes even worse while >> > bootstrap >> > progresses and creates more and more (uncompacted) SSTables. Many of >> > them >> > wont ever be compacted because the batchlog eats all the resources >> > available for compaction >> > - Streaming down the MV tables AND applying the mutations of the >> > basetables leads to redundant writes. Redundant writes are local if >> PK >> > of >> > the MV == PK of the base table and - even worse - remote if not. >> Remote >> > MV >> > updates will impact nodes that aren't even part of the bootstrap. >> > - CDC should also not be necessary during bootstrap, should it? TBD >> > >> > 2. Repair >> > Negative impact is similar to bootstrap but, ... >> > >> > - Sending repairs through write path will not mark the streamed >> tables >> > as repaired. See CASSANDRA-12888. Doing NOT so will instantly solve >> that >> > issue. Much simpler with any other solution >> > - It will change the "repair design" a bit. Repairing a base table >> will >> > not automatically repair the MV. But is this bad at all? To be honest >> > as a >> > newbie it was very hard for me to understand what I had to do to be >> sure >> > that everything is repaired correctly. Recently I was told NOT to >> > repair MV >> > CFs but only to repair the base tables. This means one cannot just >> call >> > "nodetool repair $keyspace" - this is complicated, not transparent >> and >> > it >> > sucks. I changed the behaviour in my own branch and let run the >> dtests >> > for >> > MVs. 2 tests failed: >> > - base_replica_repair_test of course failes due to the design >> change >> > - really_complex_repair_test fails because it intentionally times >> out >> > the batch log. IMHO this is a bearable situation. It is >> comparable to >> > resurrected tombstones when running a repair after GCGS expired. >> You >> > also >> > would not expect this to be magically fixed. gcgs default is 10 >> > days and I >> > can expect that anybody also repairs its MVs during that period, >> not >> > only >> > the base table >> > >> > 3. Rebuild >> > Same like bootstrap, isn't it? >> > >> > Did I forget any cases? >> > What do you think? >> > >> > -- >> > Benjamin Roth >> > Prokurist >> > >> > Jaumo GmbH · www.jaumo.com >> > Wehrstraße 46 · 73035 Göppingen · Germany >> > Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1 >> > <07161%203048801> >> > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer >> > >> > >> > >> > -- >> > Benjamin Roth >> > Prokurist >> > >> > Jaumo GmbH · www.jaumo.com >> > Wehrstraße 46 · 73035 Göppingen · Germany >> > Phone +49 7161 304880-6 · Fax +49 7161 304880-1 >> > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer >> > >> > > > > -- > Benjamin Roth > Prokurist > > Jaumo GmbH · www.jaumo.com > Wehrstraße 46 · 73035 Göppingen · Germany > Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1 > <07161%203048801> > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer > -- Benjamin Roth Prokurist Jaumo GmbH · www.jaumo.com Wehrstraße 46 · 73035 Göppingen · Germany Phone +49 7161 304880-6 · Fax +49 7161 304880-1 AG Ulm · HRB 731058 · Managing Director: Jens Kammerer