[ https://issues.apache.org/jira/browse/CASSANDRA-11475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sylvain Lebresne updated CASSANDRA-11475: ----------------------------------------- Resolution: Fixed Fix Version/s: (was: 3.0.x) (was: 3.x) 3.0.7 3.7 Status: Resolved (was: Patch Available) Committed, thanks. (For info, the first run of CI wasn't happy because I wrote the {{removeByName}} method using the iterator {{remove()}} method but {{COWArrayList}} doesn't support that. Anyway, that was trivial to fix and after that CI was happy. I also had to rebase on trunk and I re-ran CI to make sure it was happy there too, which it was. Hence the slight delay committing) > MV code refactor > ---------------- > > Key: CASSANDRA-11475 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11475 > Project: Cassandra > Issue Type: Bug > Reporter: Sylvain Lebresne > Assignee: Sylvain Lebresne > Fix For: 3.7, 3.0.7 > > > While working on CASSANDRA-5546 I run into a problem with TTLs on MVs, which > looking more closely is a bug of the MV code. But one thing leading to > another I reviewed a good portion of the MV code and found the following > correction problems: > * If a base row is TTLed then even if an update remove that TTL the view > entry remained TTLed and expires, leading to an inconsistency. > * Due to calling the wrong ctor for {{LivenessInfo}}, when a TTL was set on > the base table, the view entry was living twice as long as the TTL. Again > leading to a temporary inconsistency. > * When reading existing data to compute view updates, all deletion > informations are completely ignored (the code uses a {{PartitionIterator}} > instead of an {{UnfilteredPartitionIterator}}). This is a serious issue since > it means some deletions could be totally ignored as far as views are > concerned especially when messages are delivered to a replica out of order. > I'll note that while the 2 previous points are relatively easy to fix, I > didn't find an easy and clean way to fix this one on the current code. > Further, I think the MV code in general has inefficiencies/code complexities > that should be avoidable: > * {{TemporalRow.Set}} is buffering both everything read and a pretty much > complete copy of the updates. That's a potentially high memory requirement. > We shouldn't have to copy the updates and we shouldn't buffer all reads but > rather work incrementally. > * {{TemporalRow}}/{{TemporalRow.Set}}/{{TemporalCell}} classes are somewhat > re-inventing the wheel. They are really just storing both an update we're > doing and the corresponding existing data, but we already have > {{Row}}/{{Partition}}/{{Cell}} for that. In practice, those {{Temporal*}} > class generates a lot of allocations that we could avoid. > * The code from CASSANDRA-10060 to avoid multiple reads of the base table > with multiple views doesn't work when the update has partition/range > tombstones because the code uses {{TemporalRow.Set.setTombstonedExisting()}} > to trigger reuse, but the {{TemporalRow.Set.withNewViewPrimaryKey()}} method > is used between view and it does not preseve the {{hasTombstonedExisting}} > flag. But that oversight, which is trivial to fix, is kind of a good thing > since if you fix it, you're left with a correction problem. > The read done when there is a partition deletion depends on the view itself > (if there is clustering filters in particular) and so reusing that read for > other views is wrong. Which makes that whole reuse code really dodgy imo: the > read for existing data is in {{View.java}}, suggesting that it depends on the > view (which again, it does at least for partition deletion), but it shouldn't > if we're going to reuse the result across multiple views. > * Even ignoring the previous point, we still potentially read the base table > twice if the update mix both row updates and partition/range deletions, > potentially re-reading the same values. > * It's probably more minor but the reading code is using {{QueryPager}}, > which is probably an artifact of the initial version of the code being > pre-8099, but it's not necessary anymore (the reads are local and locally > we're completely iterator based), adding, especially when we do page. I'll > note that despite using paging, the current code still buffers everything in > {{TemporalRow.Set}} anyway . > Overall, I suspect trying to fix the problems above (particularly the fact > that existing deletion infos are ignored) is only going to add complexity > with the current code and we'd still have to fix the inefficiencies. So I > propose a refactor of that code which does 2 main things: > # it removes all of {{TemporalRow}} and related classes. Instead, it directly > uses the existing {{Row}} (with all its deletion infos) and the update being > applied to it and compute the updates for the view from that. I submit that > this is more clear/simple, but this also avoid copying every cell of both the > existing and update data as a {{TemporalCell}}. We can also reuse codes like > {{Rows.merge}} and {{Rows.diff}} to make the handling of deletions relatively > painless. > # instead of dealing with each view one at a time, re-iterating over all > updates each time, it iterates over each individual updates once and deal > with each view for that update. This makes it more clear that the reads has > to care about every view involved, but more importantly allow to deal with > the read data incrementally, never buffering it all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)