Re: CASSANDRA-12888: Streaming and MVs

Benjamin Roth Wed, 07 Dec 2016 00:09:13 -0800

Grmpf! 1000+ consecutive must be wrong. I guess I mixed sth up. But it
repaired over and over again for 1 or 2 days.


2016-12-07 9:01 GMT+01:00 Benjamin Roth <benjamin.r...@jaumo.com>:

> Hi Paolo,
>
> First of all thanks for your review!
>
> I had the same concerns as you but I thought it is beeing handled
> correctly (which does in some situations) but I found one that creates the
> inconsistencies you mentioned. That is kind of split brain syndrom, when
> multiple nodes fail between repairs. See here: https://cl.ly/3t0X1c0q1L1h.
>
> I am not happy about it but I support your decision. We should then add
> another dtest to test this scenario as existing dtests don't.
>
> Some issues unfortunately remain:
> - 12888 is not resolved
> - MV repairs may be still f**** slow. Imagine an inconsistency of a single
> cell (also may be due to a validation race condition, see CASSANDRA-12991)
> on a big partition. I had issues with reaper and a 30min timeout leading to
> 1000+ (yes!) consecutive repairs of a single subrange because it always
> timed out and I recognized very late. When I deployed 12888 on my system,
> this remaining subrange was repaired in a snap
> - I guess rebuild works the same as repair and has to go through the write
> path, right?
>
> => The MV repair may induce so much overhead that it is maybe cheaper to
> kill and replace a inconsistent node than to repair it. But that may
> introduce inconsistencies again. All in all it is not perfect. All this
> does not really un-frustrate me a 100%.
>
> Do you have any more thoughts?
>
> Unfortunately I have very little time these days as my second child was
> born on monday. So thanks for your support so far. Maybe I have some ideas
> on this issues during the next days and I will work on that ticket probably
> next week to come to a solution that is at least deployable. I'd also
> appreciate your opinion on CASSANDRA-12991.
>
> 2016-12-07 2:53 GMT+01:00 Paulo Motta <pauloricard...@gmail.com>:
>
>> Hello Benjamin,
>>
>> Thanks for your effort on this investigation! For bootstraps and range
>> transfers, I think we can indeed simplify and stream base tables and MVs
>> as
>> ordinary tables, unless there is some caveat I'm missing (I didn't find
>> any
>> special case for bootstrap/range transfers on CASSANDRA-6477 or in the MV
>> design doc, please correct me if I'm wrong).
>>
>> Regarding repair of base tables, applying mutations via the write path is
>> a
>> matter of correctness, given that the base table updates needs to
>> potentially remove previously referenced keys in the views, so repairing
>> only the base table may leave unreferenced keys in the views, breaking the
>> MV contract. Furthermore, these unreferenced keys may be propagated to
>> other replicas and never removed if you repair only the view. If you don't
>> do overwrites in the base table, this is probably not a problem but the DB
>> cannot ensure this (at least not before CASSANDRA-9779). Furthermore, as
>> you already noticed repairing only the base table is probably faster so I
>> don't see a reason to repair the base and MVs separately since this is
>> potentially more costly. I believe your frustration is mostly due to the
>> bug described on CASSANDRA-12905, but after that and CASSANDRA-12888 are
>> fixed repair on base table should work just fine.
>>
>> Based on this, I propose:
>> - Fix CASSANDRA-12905 with your original patch that retries acquiring the
>> MV lock instead of throwing WriteTimeoutException during streaming, since
>> this is blocking 3.10.
>> - Fix CASSANDRA-12888 by doing sstable-based streaming for base tables
>> while still applying MV updates in the paired replicas.
>> - Create new ticket to use ordinary streaming for non-repair MV stream
>> sessions and keep current behavior for MV streaming originating from
>> repair.
>> - Create new ticket to include only the base tables and not MVs in
>> keyspace-level repair, since repairing the base already repairs the views
>> to avoid people shooting themselves in the foot.
>>
>> Please let me know what do you think. Any suggestions or feedback is
>> appreciated.
>>
>> Cheers,
>>
>> Paulo
>>
>> 2016-12-02 8:27 GMT-02:00 Benjamin Roth <benjamin.r...@jaumo.com>:
>>
>> > As I haven't received a single reply on that, I went over to implement
>> and
>> > test it on my own with our production cluster. I had a real pain with
>> > bringing up a new node, so I had to move on.
>> >
>> > Result:
>> > Works like a charm. I ran many dtests that relate in any way with
>> storage,
>> > stream, bootstrap, ... with good results.
>> > The bootstrap finished in under 5:30h, not a single error log during
>> > bootstrap. Also afterwards, repairs run smooth, cluster seems to operate
>> > quite well.
>> >
>> > I still need:
>> >
>> >    - Reviews (see 12888, 12905, 12984)
>> >    - Some opinion if I did the CDC case right. IMHO CDC is not required
>> on
>> >    bootstrap and we don't need to send the mutations through the write
>> path
>> >    just to write the commit log. This will also break incremental
>> repairs.
>> >    Instead for CDC the sstables are streamed like normal but mutations
>> are
>> >    written to commitlog additionally. The worst I see is that the node
>> > crashes
>> >    and the commitlogs for that repair streams are replayed leading to
>> >    duplicate writes, which is not really crucial and not a regular case.
>> > Any
>> >    better ideas?
>> >    - Docs have to be updated (12985) if patch is accepted
>> >
>> > I really appreciate ANY feedback. IMHO the impact of that fixes is
>> immense
>> > and maybe will be a huge step to get MVs production ready.
>> >
>> > Thank you very much,
>> > Benjamin
>> >
>> >
>> > ---------- Forwarded message ----------
>> > From: Benjamin Roth <benjamin.r...@jaumo.com>
>> > Date: 2016-11-29 17:04 GMT+01:00
>> > Subject: Streaming and MVs
>> > To: dev@cassandra.apache.org
>> >
>> >
>> > I don't know where else to discuss this issue, so I post it here.
>> >
>> > I am trying to get CS to run stable with MVs since the beginning of
>> july.
>> > Normal reads + write do work as expected but when it comes to repairs or
>> > bootstrapping it still feels far far away from what I would call fast
>> and
>> > stable. The other day I just wanted to bootstrap a new node. I tried it
>> 2
>> > times.
>> > First time the bootstrap failed due to WTEs. I fixed this issue by not
>> > timing out in streams but then it turned out that the bootstrap (load
>> > roughly 250-300 GB) didn't even finish in 24h. What if I really had a
>> > problem and had to get up some nodes fast? No way!
>> >
>> > I think the root cause of it all is the way streams are handled on
>> tables
>> > with MVs.
>> > Sending them to the regular write path implies so many bottlenecks and
>> > sometimes also redundant writes. Let me explain:
>> >
>> > 1. Bootstrap
>> > During a bootstrap, all ranges from all KS and all CFs that will belong
>> to
>> > the new node will be streamed. MVs are treated like all other CFs and
>> all
>> > ranges that will move to the new node will also be streamed during
>> > bootstrap.
>> > Sending streams of the base tables through the write path will have the
>> > following negative impacts:
>> >
>> >    - Writes are sent to the commit log. Not necessary. When node is
>> stopped
>> >    during bootstrap, bootstrap will simply start over. No need to
>> recover
>> > from
>> >    commit logs. Non-MV tables won't have a CL anyway
>> >    - MV mutations will not be applied instantly but send to the batch
>> log.
>> >    This is of course necessary during the range movement (if PK of MV
>> > differs
>> >    from base table) but what happens: The batchlog will be completely
>> > flooded.
>> >    This leads to ridiculously large batchlogs (I observerd BLs with 60GB
>> >    size), zillions of compactions and quadrillions of tombstones. This
>> is a
>> >    pure resource killer, especially because BL uses a CF as a queue.
>> >    - Applying every mutation separately causes read-before-writes
>> during MV
>> >    mutation. This is of course an order of magnitude slower than simply
>> >    streaming down an SSTable. This effect becomes even worse while
>> > bootstrap
>> >    progresses and creates more and more (uncompacted) SSTables. Many of
>> > them
>> >    wont ever be compacted because the batchlog eats all the resources
>> >    available for compaction
>> >    - Streaming down the MV tables AND applying the mutations of the
>> >    basetables leads to redundant writes. Redundant writes are local if
>> PK
>> > of
>> >    the MV == PK of the base table and - even worse - remote if not.
>> Remote
>> > MV
>> >    updates will impact nodes that aren't even part of the bootstrap.
>> >    - CDC should also not be necessary during bootstrap, should it? TBD
>> >
>> > 2. Repair
>> > Negative impact is similar to bootstrap but, ...
>> >
>> >    - Sending repairs through write path will not mark the streamed
>> tables
>> >    as repaired. See CASSANDRA-12888. Doing NOT so will instantly solve
>> that
>> >    issue. Much simpler with any other solution
>> >    - It will change the "repair design" a bit. Repairing a base table
>> will
>> >    not automatically repair the MV. But is this bad at all? To be honest
>> > as a
>> >    newbie it was very hard for me to understand what I had to do to be
>> sure
>> >    that everything is repaired correctly. Recently I was told NOT to
>> > repair MV
>> >    CFs but only to repair the base tables. This means one cannot just
>> call
>> >    "nodetool repair $keyspace" - this is complicated, not transparent
>> and
>> > it
>> >    sucks. I changed the behaviour in my own branch and let run the
>> dtests
>> > for
>> >    MVs. 2 tests failed:
>> >       - base_replica_repair_test of course failes due to the design
>> change
>> >       - really_complex_repair_test fails because it intentionally times
>> out
>> >       the batch log. IMHO this is a bearable situation. It is
>> comparable to
>> >       resurrected tombstones when running a repair after GCGS expired.
>> You
>> > also
>> >       would not expect this to be magically fixed. gcgs default is 10
>> > days and I
>> >       can expect that anybody also repairs its MVs during that period,
>> not
>> > only
>> >       the base table
>> >
>> > 3. Rebuild
>> > Same like bootstrap, isn't it?
>> >
>> > Did I forget any cases?
>> > What do you think?
>> >
>> > --
>> > Benjamin Roth
>> > Prokurist
>> >
>> > Jaumo GmbH · www.jaumo.com
>> > Wehrstraße 46 · 73035 Göppingen · Germany
>> > Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
>> > <07161%203048801>
>> > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>> >
>> >
>> >
>> > --
>> > Benjamin Roth
>> > Prokurist
>> >
>> > Jaumo GmbH · www.jaumo.com
>> > Wehrstraße 46 · 73035 Göppingen · Germany
>> > Phone +49 7161 304880-6 · Fax +49 7161 304880-1
>> > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>> >
>>
>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
> <07161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: CASSANDRA-12888: Streaming and MVs

Reply via email to