Re: CASSANDRA-12888: Streaming and MVs

Benjamin Roth Wed, 07 Dec 2016 00:02:22 -0800

Hi Paolo,

First of all thanks for your review!


I had the same concerns as you but I thought it is beeing handled correctly
(which does in some situations) but I found one that creates the
inconsistencies you mentioned. That is kind of split brain syndrom, when
multiple nodes fail between repairs. See here: https://cl.ly/3t0X1c0q1L1h.

I am not happy about it but I support your decision. We should then add
another dtest to test this scenario as existing dtests don't.

Some issues unfortunately remain:
- 12888 is not resolved
- MV repairs may be still f**** slow. Imagine an inconsistency of a single
cell (also may be due to a validation race condition, see CASSANDRA-12991)
on a big partition. I had issues with reaper and a 30min timeout leading to
1000+ (yes!) consecutive repairs of a single subrange because it always
timed out and I recognized very late. When I deployed 12888 on my system,
this remaining subrange was repaired in a snap
- I guess rebuild works the same as repair and has to go through the write
path, right?

=> The MV repair may induce so much overhead that it is maybe cheaper to
kill and replace a inconsistent node than to repair it. But that may
introduce inconsistencies again. All in all it is not perfect. All this
does not really un-frustrate me a 100%.

Do you have any more thoughts?

Unfortunately I have very little time these days as my second child was
born on monday. So thanks for your support so far. Maybe I have some ideas
on this issues during the next days and I will work on that ticket probably
next week to come to a solution that is at least deployable. I'd also
appreciate your opinion on CASSANDRA-12991.

2016-12-07 2:53 GMT+01:00 Paulo Motta <pauloricard...@gmail.com>:

> Hello Benjamin,
>
> Thanks for your effort on this investigation! For bootstraps and range
> transfers, I think we can indeed simplify and stream base tables and MVs as
> ordinary tables, unless there is some caveat I'm missing (I didn't find any
> special case for bootstrap/range transfers on CASSANDRA-6477 or in the MV
> design doc, please correct me if I'm wrong).
>
> Regarding repair of base tables, applying mutations via the write path is a
> matter of correctness, given that the base table updates needs to
> potentially remove previously referenced keys in the views, so repairing
> only the base table may leave unreferenced keys in the views, breaking the
> MV contract. Furthermore, these unreferenced keys may be propagated to
> other replicas and never removed if you repair only the view. If you don't
> do overwrites in the base table, this is probably not a problem but the DB
> cannot ensure this (at least not before CASSANDRA-9779). Furthermore, as
> you already noticed repairing only the base table is probably faster so I
> don't see a reason to repair the base and MVs separately since this is
> potentially more costly. I believe your frustration is mostly due to the
> bug described on CASSANDRA-12905, but after that and CASSANDRA-12888 are
> fixed repair on base table should work just fine.
>
> Based on this, I propose:
> - Fix CASSANDRA-12905 with your original patch that retries acquiring the
> MV lock instead of throwing WriteTimeoutException during streaming, since
> this is blocking 3.10.
> - Fix CASSANDRA-12888 by doing sstable-based streaming for base tables
> while still applying MV updates in the paired replicas.
> - Create new ticket to use ordinary streaming for non-repair MV stream
> sessions and keep current behavior for MV streaming originating from
> repair.
> - Create new ticket to include only the base tables and not MVs in
> keyspace-level repair, since repairing the base already repairs the views
> to avoid people shooting themselves in the foot.
>
> Please let me know what do you think. Any suggestions or feedback is
> appreciated.
>
> Cheers,
>
> Paulo
>
> 2016-12-02 8:27 GMT-02:00 Benjamin Roth <benjamin.r...@jaumo.com>:
>
> > As I haven't received a single reply on that, I went over to implement
> and
> > test it on my own with our production cluster. I had a real pain with
> > bringing up a new node, so I had to move on.
> >
> > Result:
> > Works like a charm. I ran many dtests that relate in any way with
> storage,
> > stream, bootstrap, ... with good results.
> > The bootstrap finished in under 5:30h, not a single error log during
> > bootstrap. Also afterwards, repairs run smooth, cluster seems to operate
> > quite well.
> >
> > I still need:
> >
> >    - Reviews (see 12888, 12905, 12984)
> >    - Some opinion if I did the CDC case right. IMHO CDC is not required
> on
> >    bootstrap and we don't need to send the mutations through the write
> path
> >    just to write the commit log. This will also break incremental
> repairs.
> >    Instead for CDC the sstables are streamed like normal but mutations
> are
> >    written to commitlog additionally. The worst I see is that the node
> > crashes
> >    and the commitlogs for that repair streams are replayed leading to
> >    duplicate writes, which is not really crucial and not a regular case.
> > Any
> >    better ideas?
> >    - Docs have to be updated (12985) if patch is accepted
> >
> > I really appreciate ANY feedback. IMHO the impact of that fixes is
> immense
> > and maybe will be a huge step to get MVs production ready.
> >
> > Thank you very much,
> > Benjamin
> >
> >
> > ---------- Forwarded message ----------
> > From: Benjamin Roth <benjamin.r...@jaumo.com>
> > Date: 2016-11-29 17:04 GMT+01:00
> > Subject: Streaming and MVs
> > To: dev@cassandra.apache.org
> >
> >
> > I don't know where else to discuss this issue, so I post it here.
> >
> > I am trying to get CS to run stable with MVs since the beginning of july.
> > Normal reads + write do work as expected but when it comes to repairs or
> > bootstrapping it still feels far far away from what I would call fast and
> > stable. The other day I just wanted to bootstrap a new node. I tried it 2
> > times.
> > First time the bootstrap failed due to WTEs. I fixed this issue by not
> > timing out in streams but then it turned out that the bootstrap (load
> > roughly 250-300 GB) didn't even finish in 24h. What if I really had a
> > problem and had to get up some nodes fast? No way!
> >
> > I think the root cause of it all is the way streams are handled on tables
> > with MVs.
> > Sending them to the regular write path implies so many bottlenecks and
> > sometimes also redundant writes. Let me explain:
> >
> > 1. Bootstrap
> > During a bootstrap, all ranges from all KS and all CFs that will belong
> to
> > the new node will be streamed. MVs are treated like all other CFs and all
> > ranges that will move to the new node will also be streamed during
> > bootstrap.
> > Sending streams of the base tables through the write path will have the
> > following negative impacts:
> >
> >    - Writes are sent to the commit log. Not necessary. When node is
> stopped
> >    during bootstrap, bootstrap will simply start over. No need to recover
> > from
> >    commit logs. Non-MV tables won't have a CL anyway
> >    - MV mutations will not be applied instantly but send to the batch
> log.
> >    This is of course necessary during the range movement (if PK of MV
> > differs
> >    from base table) but what happens: The batchlog will be completely
> > flooded.
> >    This leads to ridiculously large batchlogs (I observerd BLs with 60GB
> >    size), zillions of compactions and quadrillions of tombstones. This
> is a
> >    pure resource killer, especially because BL uses a CF as a queue.
> >    - Applying every mutation separately causes read-before-writes during
> MV
> >    mutation. This is of course an order of magnitude slower than simply
> >    streaming down an SSTable. This effect becomes even worse while
> > bootstrap
> >    progresses and creates more and more (uncompacted) SSTables. Many of
> > them
> >    wont ever be compacted because the batchlog eats all the resources
> >    available for compaction
> >    - Streaming down the MV tables AND applying the mutations of the
> >    basetables leads to redundant writes. Redundant writes are local if PK
> > of
> >    the MV == PK of the base table and - even worse - remote if not.
> Remote
> > MV
> >    updates will impact nodes that aren't even part of the bootstrap.
> >    - CDC should also not be necessary during bootstrap, should it? TBD
> >
> > 2. Repair
> > Negative impact is similar to bootstrap but, ...
> >
> >    - Sending repairs through write path will not mark the streamed tables
> >    as repaired. See CASSANDRA-12888. Doing NOT so will instantly solve
> that
> >    issue. Much simpler with any other solution
> >    - It will change the "repair design" a bit. Repairing a base table
> will
> >    not automatically repair the MV. But is this bad at all? To be honest
> > as a
> >    newbie it was very hard for me to understand what I had to do to be
> sure
> >    that everything is repaired correctly. Recently I was told NOT to
> > repair MV
> >    CFs but only to repair the base tables. This means one cannot just
> call
> >    "nodetool repair $keyspace" - this is complicated, not transparent and
> > it
> >    sucks. I changed the behaviour in my own branch and let run the dtests
> > for
> >    MVs. 2 tests failed:
> >       - base_replica_repair_test of course failes due to the design
> change
> >       - really_complex_repair_test fails because it intentionally times
> out
> >       the batch log. IMHO this is a bearable situation. It is comparable
> to
> >       resurrected tombstones when running a repair after GCGS expired.
> You
> > also
> >       would not expect this to be magically fixed. gcgs default is 10
> > days and I
> >       can expect that anybody also repairs its MVs during that period,
> not
> > only
> >       the base table
> >
> > 3. Rebuild
> > Same like bootstrap, isn't it?
> >
> > Did I forget any cases?
> > What do you think?
> >
> > --
> > Benjamin Roth
> > Prokurist
> >
> > Jaumo GmbH · www.jaumo.com
> > Wehrstraße 46 · 73035 Göppingen · Germany
> > Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
> > <07161%203048801>
> > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
> >
> >
> >
> > --
> > Benjamin Roth
> > Prokurist
> >
> > Jaumo GmbH · www.jaumo.com
> > Wehrstraße 46 · 73035 Göppingen · Germany
> > Phone +49 7161 304880-6 · Fax +49 7161 304880-1
> > AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
> >
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: CASSANDRA-12888: Streaming and MVs

Reply via email to