[
https://issues.apache.org/jira/browse/CASSANDRA-11475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15269297#comment-15269297
]
Carl Yeksigian commented on CASSANDRA-11475:
--------------------------------------------
These changes look good; I kicked off CI again, so assuming cassci is happy,
I'm +1.
> MV code refactor
> ----------------
>
> Key: CASSANDRA-11475
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11475
> Project: Cassandra
> Issue Type: Bug
> Reporter: Sylvain Lebresne
> Assignee: Sylvain Lebresne
> Fix For: 3.0.x, 3.x
>
>
> While working on CASSANDRA-5546 I run into a problem with TTLs on MVs, which
> looking more closely is a bug of the MV code. But one thing leading to
> another I reviewed a good portion of the MV code and found the following
> correction problems:
> * If a base row is TTLed then even if an update remove that TTL the view
> entry remained TTLed and expires, leading to an inconsistency.
> * Due to calling the wrong ctor for {{LivenessInfo}}, when a TTL was set on
> the base table, the view entry was living twice as long as the TTL. Again
> leading to a temporary inconsistency.
> * When reading existing data to compute view updates, all deletion
> informations are completely ignored (the code uses a {{PartitionIterator}}
> instead of an {{UnfilteredPartitionIterator}}). This is a serious issue since
> it means some deletions could be totally ignored as far as views are
> concerned especially when messages are delivered to a replica out of order.
> I'll note that while the 2 previous points are relatively easy to fix, I
> didn't find an easy and clean way to fix this one on the current code.
> Further, I think the MV code in general has inefficiencies/code complexities
> that should be avoidable:
> * {{TemporalRow.Set}} is buffering both everything read and a pretty much
> complete copy of the updates. That's a potentially high memory requirement.
> We shouldn't have to copy the updates and we shouldn't buffer all reads but
> rather work incrementally.
> * {{TemporalRow}}/{{TemporalRow.Set}}/{{TemporalCell}} classes are somewhat
> re-inventing the wheel. They are really just storing both an update we're
> doing and the corresponding existing data, but we already have
> {{Row}}/{{Partition}}/{{Cell}} for that. In practice, those {{Temporal*}}
> class generates a lot of allocations that we could avoid.
> * The code from CASSANDRA-10060 to avoid multiple reads of the base table
> with multiple views doesn't work when the update has partition/range
> tombstones because the code uses {{TemporalRow.Set.setTombstonedExisting()}}
> to trigger reuse, but the {{TemporalRow.Set.withNewViewPrimaryKey()}} method
> is used between view and it does not preseve the {{hasTombstonedExisting}}
> flag. But that oversight, which is trivial to fix, is kind of a good thing
> since if you fix it, you're left with a correction problem.
> The read done when there is a partition deletion depends on the view itself
> (if there is clustering filters in particular) and so reusing that read for
> other views is wrong. Which makes that whole reuse code really dodgy imo: the
> read for existing data is in {{View.java}}, suggesting that it depends on the
> view (which again, it does at least for partition deletion), but it shouldn't
> if we're going to reuse the result across multiple views.
> * Even ignoring the previous point, we still potentially read the base table
> twice if the update mix both row updates and partition/range deletions,
> potentially re-reading the same values.
> * It's probably more minor but the reading code is using {{QueryPager}},
> which is probably an artifact of the initial version of the code being
> pre-8099, but it's not necessary anymore (the reads are local and locally
> we're completely iterator based), adding, especially when we do page. I'll
> note that despite using paging, the current code still buffers everything in
> {{TemporalRow.Set}} anyway .
> Overall, I suspect trying to fix the problems above (particularly the fact
> that existing deletion infos are ignored) is only going to add complexity
> with the current code and we'd still have to fix the inefficiencies. So I
> propose a refactor of that code which does 2 main things:
> # it removes all of {{TemporalRow}} and related classes. Instead, it directly
> uses the existing {{Row}} (with all its deletion infos) and the update being
> applied to it and compute the updates for the view from that. I submit that
> this is more clear/simple, but this also avoid copying every cell of both the
> existing and update data as a {{TemporalCell}}. We can also reuse codes like
> {{Rows.merge}} and {{Rows.diff}} to make the handling of deletions relatively
> painless.
> # instead of dealing with each view one at a time, re-iterating over all
> updates each time, it iterates over each individual updates once and deal
> with each view for that update. This makes it more clear that the reads has
> to care about every view involved, but more importantly allow to deal with
> the read data incrementally, never buffering it all.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)