Hi Kurt, First of all thanks for this elaborate post.
At this moment, I don't want to come up with a solution for all MV issues but I would like to point out, why I was quite active some time ago and why I pulled myself back. As you also mentioned in different words, it seems to me that MVs are an orphan in CS. They started out as a shiny and promising feature, but ... . When I came to CS, MVs were one of the reasons why I gave CS in general and 3.0 in special a try. But when I started to work with MVs in production - willing to overcome the "little obstacles" and the fact they are "not quite stable" - I started to realize that there is almost no support from the community. The initial contributors turned their back on MVs. All that remained is a 95% ready feature, a lot of public documentation but no disclaimer that says "Please Do Not Use MVs". And every time when a discussion pops up around MVs the bottom line is: - All or most of involved people have not much experience in MVs - Original contributors are not involved - It seems to me, discussions are more based on assumptions or superficial knowledge than on real knowledge/experience/research/proofs - Bringing in code changes is difficult for the same reasons. Nobody likes to take over the "old heritage" or take over responsibility for it. And it seems that nobody feels confident enough to bring in critical changes - I don't want to touch this critical part in the code path, I know we have tests but ... Initially I was very eager to contribute and to help MV to get mature but over time it turned out it is very cumbersome and frustrating. Additionally I have very little time left in my daily routine to work on CS. So I decided to work on a solution that solved our specific problems with CS and MVs. I am not really happy with it but it actually works quite well. To be honest, I also had in the back of my head to write a posing similar to yours. I would really like to contribute and bring MVs forward, but not at all costs. I see many problems with MVs, even some that haven't even been mentioned, yet. But I do not want to come up with half-baked assumptions. What really lacks for MVs is a reproducible code-based proof what works and what does not. One example is the question "Why can I add only a single column to an MV PK". I have read arguments of which I think they are not quite right or "somehow incomplete". There are a lot of arguments and discussions that are totally scattered across JIRA and it seems to me that every contributor knows a little bit of this and a little bit of that and remember this post or that post. I was already thinking of setting up super-reduced "storage mock" to prove / find edge cases in MV fail-and-repair scenarios to answer questions like these with code instead of sentences like "I think that... " or "I can remember a comment of ...". Unfortunately dtests are super painful things like that because a) they are f***** slow b) it is super complicated to simulate a certain situation. I also did not see a simple way to do this with the CS unit test suite as I didn't see a way to boot and control multiple storages there. *What I miss is a central consensus about "MV business rules" + a central set of proofs and or tests that support these rules and proof or falsify assumptions in a reproducible way.* The reason why I did not already come up with sth like that: - Time - Frustration If I can see that there are more people who feel like that and are willing to work together to find a solid solution, my level of frustration could turn into motivation again. -- Last but not least for those who care: One of the solutions I created was to implement our own version of Tickler (full table scans with CL_ALL to enforce read repair) to get rid of these damned built-in repairs which simply don't work well (especially) for MVs. To only name a few numbers: - We could bring down the repair time of a KS with RF=5 from 5 hours to 5 minutes. Really. I could not believe it. - No more "compaction storms" or piling up compaction queues or compactions falling behind - No more SSTables piling up. Before it was normal that the number of SSTables went up from 300-400 to 5000 and more. After: No noticeable change. (Btw that was the reason for CASSANDRA-12730. This isn't even bound to MVs, they maybe only amplify the impact of the underlying design) - We now repair the whole cluster in 16h (10 nodes, 400-450gb load each, 14KS). Before we had single keyspaces that took more than a day to finish. Sometimes they took even 3 days with reaper because of "Too many compactions" - It showed us problems in our model. We had data that was not readable at all due to massive tombstones + read timeouts ... if someone is interested in more details, just ping me. - Benjamin 2017-07-17 6:22 GMT+02:00 kurt greaves <k...@instaclustr.com>: > wall of text inc. > *tl;dr: *Aiming to come to some conclusions about what we are doing with > MV's and how we are going to make them stable in production. But really > just trying to raise awareness/involvement for MV's. > > It seems we've got an excess of MV bugs that pretty much make them > completely unusable in production, or at least incredibly risky and also > limited. It also appears that we don't have many people totally across MV's > either (or at least a lack of people currently looking at them). To avoid > us "forgetting" about MV's I'd like to raise the current issues and get > opinions on the direction we should go with MV's. I know historically there > was a lot of discussion about this, but it seems a lot of the originally > involved are currently less involved, and thus before making wild changes > to MV's it might be worth going back to the start and think through the > original requirements and implementation. > > Probably worth summarising the original goals of MV's: > > - Maintain eventual consistency between base table and view tables > - Provide mechanisms to repair consistency between base and views > - Aim to keep convergence between base and view fast without sacrificing > availability (low MTTR) > Goals that weren't explicitly mentioned but more or less implied: > - Performance must be at least good enough to justify using them over > rolling-your-own. (we haven't really tried to measure this yet - only > measured in comparison to not-a-MV) > - Allow a user to redefine their partitioning key > > And also a quick summary of *some *of the limitations in our implementation > (there are more, but majority of our current problems revolve around > these): > > 1. Primary key of the base table must be included in the view, > optionally one non-primary key column can be included in the view > primary > key. > 2. All columns in the view primary key must be declared NOT NULL. > 3. Base tables and views are one-to-one. That is, a *primary key* in a > base maps to exactly one *primary key *in the view. Therefore you should > never expect multiple rows in the view for a partition with multiple > rows > in the base. > > > I've summarised the bulk of the outstanding bugs below (may have missed > some), but notably it would be useful to get some decision-making happening > on them. Fixing these bugs is a bit more involved and there is likely a few > possible solutions and implications. Also they all pretty much touch the > same parts of the code, so needs to be some collaboration across the > patches (part of the reason I'm trying to bring more attention to them). > > CASSANDRA-13657 <https://issues.apache.org/jira/browse/CASSANDRA-13657> - > Using a non-PK column in the view PK means that you can TTL that column in > the base without TTLing the resulting view row. Potential solution is to > change the definition of liveness info for view rows. This would probably > work but makes moving away from the NOT NULL requirement on view PK's > harder. Need to decide if that's what we want to do or if we pursue a > different solution. > > CASSANDRA-13127 <https://issues.apache.org/jira/browse/CASSANDRA-13127> - > Inserting with key with a TTL then updating the TTL on a column from the > base that doesn't exist in the view doesn't update the liveness of the row > in the MV, and thus the MV row expires before the base. The current > proposed solution should work but will increase the amount of cases where > we need to read the existing data. Needs some reviewing and wouldn't hurt > to benchmark the changes. > > CASSANDRA-13547 <https://issues.apache.org/jira/browse/CASSANDRA-13547> - > Being able to leave a column out of your SELECT but including it in the > view filters causes some serious issues. Proposed fix is to force user to > select all columns also included in where clause. This will potentially be > a compatibility issue but *should *be fine as it only is checked on MV > creation - so people upgrading shouldn't be affected (needs reviewing). > Also another issue is addressed in the patch regarding timestamps - choice > of timestamps led to rows not being deleted in the view. This comes back to > the fact that we allow a non-PK column in the view PK. Needs more > reviewing. > Also related somewhat to 11500. > > CASSANDRA-13409 <https://issues.apache.org/jira/browse/CASSANDRA-13409> - > Issues with shadowable tombstones. Has a patch but not sure if resolved > based on Zhao's last comment. Another case of bringing data back in the > view and thus making base and view inconsistent. Needs reviewing. > > CASSANDRA-11500 <https://issues.apache.org/jira/browse/CASSANDRA-11500> > CASSANDRA-10965 <https://issues.apache.org/jira/browse/CASSANDRA-10965> - > Both these appear to be instances of the same issue. Got a couple of > potential solutions. Back to that problem of shadowable tombstones and > timestamps. Pretty involved and would require an in depth review as > decisions could greatly impact the complexity/usefulness of MV's. > > CASSANDRA-13069 <https://issues.apache.org/jira/browse/CASSANDRA-13069> - > Node movements can cause inconsistencies. Paulo has written a patch but > Sylvain has raised some concerns about our use of the local batchlog. > Haven't confirmed myself but belief is that our eventual consistency > guarantee is broken... :/ needs reviewing... > > CASSANDRA-12888 <https://issues.apache.org/jira/browse/CASSANDRA-12888> - > Most people are probably aware of this one. Losing the repaired_at status > for all MV streams as they are replayed through the write path. Has a > potential solution in place for 4.x, but we need to commit to a work around > for 3.11.x at least. > > CASSANDRA-12730 <https://issues.apache.org/jira/browse/CASSANDRA-12730> - > This touches on some very common repair issues that we should probably look > at, but I don't think it directly relates to MV's anymore. Might be worth > removing the Materialized View component. (but this ticket probably still > deserves a bit of attention). > > If anyone has been working on any of these tickets and no longer is able > to, either update the ticket or let me know and I'll either take over/find > some other poor soul to have a stab at it. > It would also be nice to get some volunteers who are familiar with MV's to > review the above tickets. > > Another thing I'm not sure of is that we are aiming to guarantee eventual > consistency between base and view, however even with using the batchlog my > understanding is we can't achieve this without some tool to synchronise the > base with the view, however I don't think this tool currently exists and it > seems like CASSANDRA-10346 > <https://issues.apache.org/jira/browse/CASSANDRA-10346> agrees... Can > anyone clarify if this is actually a requirement for eventual consistency? > > My general advice these days is for users to steer clear of MV's for the > moment, however we have no clear plan for when these will really be stable. > I think as some of the changes to fix MV's may potentially require a major > version change, we should at least aim to get all those in for 4.0 > (although still need to figure out what exactly these issues are). > Interested to hear peoples thoughts. >