[
https://issues.apache.org/jira/browse/CASSANDRA-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294688#comment-14294688
]
Benjamin Janssen commented on CASSANDRA-8542:
---------------------------------------------
Anyone had a chance to look into this ticket yet?
> Make get_range_slices and related succeed most of the time on tombstone heavy
> column families
> ---------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-8542
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8542
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Shawn Walker
> Priority: Minor
> Attachments: trunk-8542-InvertibleBloomReconciler.txt
>
>
> I've got a ColumnFamily in which some rows end up being used in a queue-like
> fashion, and where many tombstones tend to pile up at the left end of the
> row. As a result, I run into {{TombstoneOverwhelming}} (e.g. CASSANDRA-6117)
> when trying to list the columns of said rows.
> Please don't yell at me too loudly. I know the issue I'm describing will
> generally be considered as being due to poor use of Cassandra. I understand
> the rationale of the current behavior, and am hesitant to play with fire by
> increasing {{tombstone_fail_threshold}} to a high value. Instead, what I
> propose is an alternate method of resolving such queries on the read path.
> ----
> This is based on the following observation: on the coordinator node, when
> {{RangeSliceResponseResolver}} is resolving a range slice query, it needn't
> be aware of all tombstones that all responding nodes have within the
> specified range. Instead, it would suffice if it could determine only those
> tombstones which aren't shared by all responding nodes. E.g. if node #1
> responds with tombstones (A, B, D), node #2 responds with tombstones (A, B,
> C), and node #3 responds with tombstones (A, B, C, D),
> {{RangeSliceResponseResolver}} need only actually know about the tombstones
> (C, D): All of the responders will already have removed any relevant data for
> the tombstones (A, B) from their individual responses.
> Arranging for {{RangeSliceResponseResolver}} to discover only the non-common
> tombstones can be accomplished by using a variant of the "invertible bloom
> filter" data structure described in e.g.
> http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p218.pdf. Using
> an invertible Bloom filter, each responding node can build a (roughly) fixed
> size data structure holding a representation of all the tombstones that node
> encounters. The coordinator node can then combine the invertible Bloom
> filters. If there aren't too many non-common tombstones, the coordinator
> node will be able to enumerate all of them, and so resolve the range slice
> query.
> I see accomplishing this in a few discrete steps:
> 1. Implement an appropriate variant of the "invertible bloom filter". I've
> started this already, and will shortly upload a patch including my
> {{InvertibleBloomReconciler}} implementation. From a black box perspective,
> {{InvertibleBloomReconcilerTest.verifyFiveWayReconciliation()}} demonstrates
> how this data structure and algorithm could be used.
> 2. ("db layer") Wrap {{InvertibleBloomReconciler}} into
> {{TombstoneReconciler}}, a structure for spilling tombstones into. Refactor
> large swaths of {{org.apache.cassandra.db}}'s read path to accomodate the
> possibility of placing tombstones discovered during a read into a
> {{TombstoneReconciler}} instead of returning them within a {{Row}},
> {{List<Row>}}, {{ColumnFamily}}, etc. I've attempted a start on this, though
> this means carrying the {{TombstoneReconciler}} around through most of
> {{ColumnFamilyStore}}, practically all of {{org.apache.db.filter}}, and other
> places I haven't yet discovered. I'm wondering if I wouldn't be better off
> waiting for CASSANDRA-8099 before starting this step -- a fully iterator flow
> through {{org.apache.cassandra.db}} could make this easier, cleaner, and have
> significantly lower code impact.
> 3. ("dynamo layer") Arrange for {{RangeSliceCommand}} to include parameters
> for the IBR (table size, hash count, seed), possibly by making these part of
> {{CFMetaData}}. Allow a {{RangeSliceResponse}} to optionally return a
> {{TombstoneReconciler}} in addition to its {{List<Row>}}. Refactor
> {{RangeSliceResponseResolver}} to be capable of handling
> {{TombstoneReconciler}} s. Possibly refactor
> {{StorageProxy.getRangeSlices(...)}} to disable read repair if any responses
> contained a {{TombstoneReconciler}}.
> Since the invertible bloom filter is a probabilistic data structure, it is
> possible that resolving a query in this manner could fail. As such, I'm
> proposing that the current read path not make use of the
> {{TombstoneReconciler}} idea unless it would otherwise encounter a
> {{TombstoneOverwhelming}} situation.
> Also, there is an astronomically minute possibility (on the order if 1 /
> 2^120 ) that tombstones reconciled via a {{TombstoneReconciler}} could be bad
> data. Possibly less astronomical if someone were attempting to maliciously
> abuse this functionality, given the fact that I'm using the Murmur3 hash
> instead of a cryptographic hash within {{InvertibleBloomReconciler}}. As
> such, I'm suggesting that read repair not be enabled on this path, just in
> case.
> ----
> Any thoughts? Would there be any interest in an implementation as I've
> described? Any suggestions on who I might ask for help navigating
> {{org.apache.cassandra.db}} if/when I run into trouble there?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)