[ https://issues.apache.org/jira/browse/CASSANDRA-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294688#comment-14294688 ]
Benjamin Janssen commented on CASSANDRA-8542: --------------------------------------------- Anyone had a chance to look into this ticket yet? > Make get_range_slices and related succeed most of the time on tombstone heavy > column families > --------------------------------------------------------------------------------------------- > > Key: CASSANDRA-8542 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8542 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Shawn Walker > Priority: Minor > Attachments: trunk-8542-InvertibleBloomReconciler.txt > > > I've got a ColumnFamily in which some rows end up being used in a queue-like > fashion, and where many tombstones tend to pile up at the left end of the > row. As a result, I run into {{TombstoneOverwhelming}} (e.g. CASSANDRA-6117) > when trying to list the columns of said rows. > Please don't yell at me too loudly. I know the issue I'm describing will > generally be considered as being due to poor use of Cassandra. I understand > the rationale of the current behavior, and am hesitant to play with fire by > increasing {{tombstone_fail_threshold}} to a high value. Instead, what I > propose is an alternate method of resolving such queries on the read path. > ---- > This is based on the following observation: on the coordinator node, when > {{RangeSliceResponseResolver}} is resolving a range slice query, it needn't > be aware of all tombstones that all responding nodes have within the > specified range. Instead, it would suffice if it could determine only those > tombstones which aren't shared by all responding nodes. E.g. if node #1 > responds with tombstones (A, B, D), node #2 responds with tombstones (A, B, > C), and node #3 responds with tombstones (A, B, C, D), > {{RangeSliceResponseResolver}} need only actually know about the tombstones > (C, D): All of the responders will already have removed any relevant data for > the tombstones (A, B) from their individual responses. > Arranging for {{RangeSliceResponseResolver}} to discover only the non-common > tombstones can be accomplished by using a variant of the "invertible bloom > filter" data structure described in e.g. > http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p218.pdf. Using > an invertible Bloom filter, each responding node can build a (roughly) fixed > size data structure holding a representation of all the tombstones that node > encounters. The coordinator node can then combine the invertible Bloom > filters. If there aren't too many non-common tombstones, the coordinator > node will be able to enumerate all of them, and so resolve the range slice > query. > I see accomplishing this in a few discrete steps: > 1. Implement an appropriate variant of the "invertible bloom filter". I've > started this already, and will shortly upload a patch including my > {{InvertibleBloomReconciler}} implementation. From a black box perspective, > {{InvertibleBloomReconcilerTest.verifyFiveWayReconciliation()}} demonstrates > how this data structure and algorithm could be used. > 2. ("db layer") Wrap {{InvertibleBloomReconciler}} into > {{TombstoneReconciler}}, a structure for spilling tombstones into. Refactor > large swaths of {{org.apache.cassandra.db}}'s read path to accomodate the > possibility of placing tombstones discovered during a read into a > {{TombstoneReconciler}} instead of returning them within a {{Row}}, > {{List<Row>}}, {{ColumnFamily}}, etc. I've attempted a start on this, though > this means carrying the {{TombstoneReconciler}} around through most of > {{ColumnFamilyStore}}, practically all of {{org.apache.db.filter}}, and other > places I haven't yet discovered. I'm wondering if I wouldn't be better off > waiting for CASSANDRA-8099 before starting this step -- a fully iterator flow > through {{org.apache.cassandra.db}} could make this easier, cleaner, and have > significantly lower code impact. > 3. ("dynamo layer") Arrange for {{RangeSliceCommand}} to include parameters > for the IBR (table size, hash count, seed), possibly by making these part of > {{CFMetaData}}. Allow a {{RangeSliceResponse}} to optionally return a > {{TombstoneReconciler}} in addition to its {{List<Row>}}. Refactor > {{RangeSliceResponseResolver}} to be capable of handling > {{TombstoneReconciler}} s. Possibly refactor > {{StorageProxy.getRangeSlices(...)}} to disable read repair if any responses > contained a {{TombstoneReconciler}}. > Since the invertible bloom filter is a probabilistic data structure, it is > possible that resolving a query in this manner could fail. As such, I'm > proposing that the current read path not make use of the > {{TombstoneReconciler}} idea unless it would otherwise encounter a > {{TombstoneOverwhelming}} situation. > Also, there is an astronomically minute possibility (on the order if 1 / > 2^120 ) that tombstones reconciled via a {{TombstoneReconciler}} could be bad > data. Possibly less astronomical if someone were attempting to maliciously > abuse this functionality, given the fact that I'm using the Murmur3 hash > instead of a cryptographic hash within {{InvertibleBloomReconciler}}. As > such, I'm suggesting that read repair not be enabled on this path, just in > case. > ---- > Any thoughts? Would there be any interest in an implementation as I've > described? Any suggestions on who I might ask for help navigating > {{org.apache.cassandra.db}} if/when I run into trouble there? -- This message was sent by Atlassian JIRA (v6.3.4#6332)