[ 
https://issues.apache.org/jira/browse/CASSANDRA-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294688#comment-14294688
 ] 

Benjamin Janssen commented on CASSANDRA-8542:
---------------------------------------------

Anyone had a chance to look into this ticket yet?

> Make get_range_slices and related succeed most of the time on tombstone heavy 
> column families
> ---------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8542
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8542
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Shawn Walker
>            Priority: Minor
>         Attachments: trunk-8542-InvertibleBloomReconciler.txt
>
>
> I've got a ColumnFamily in which some rows end up being used in a queue-like 
> fashion, and where many tombstones tend to pile up at the left end of the 
> row.  As a result, I run into {{TombstoneOverwhelming}} (e.g. CASSANDRA-6117) 
> when trying to list the columns of said rows.
> Please don't yell at me too loudly. I know the issue I'm describing will 
> generally be considered as being due to poor use of Cassandra.  I understand 
> the rationale of the current behavior, and am hesitant to play with fire by 
> increasing {{tombstone_fail_threshold}} to a high value.  Instead, what I 
> propose is an alternate method of resolving such queries on the read path.
> ----
> This is based on the following observation: on the coordinator node, when 
> {{RangeSliceResponseResolver}} is resolving a range slice query, it needn't 
> be aware of all tombstones that all responding nodes have within the 
> specified range.  Instead, it would suffice if it could determine only those 
> tombstones which aren't shared by all responding nodes. E.g. if node #1 
> responds with tombstones (A, B, D), node #2 responds with tombstones (A, B, 
> C), and node #3 responds with tombstones (A, B, C, D), 
> {{RangeSliceResponseResolver}} need only actually know about the tombstones 
> (C, D): All of the responders will already have removed any relevant data for 
> the tombstones (A, B) from their individual responses.
> Arranging for {{RangeSliceResponseResolver}} to discover only the non-common 
> tombstones can be accomplished by using a variant of the "invertible bloom 
> filter" data structure described in e.g. 
> http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p218.pdf.  Using 
> an invertible Bloom filter, each responding node can build a (roughly) fixed 
> size data structure holding a representation of all the tombstones that node 
> encounters.  The coordinator node can then combine the invertible Bloom 
> filters.  If there aren't too many non-common tombstones, the coordinator 
> node will be able to enumerate all of them, and so resolve the range slice 
> query.
> I see accomplishing this in a few discrete steps:
> 1. Implement an appropriate variant of the "invertible bloom filter".  I've 
> started this already, and will shortly upload a patch including my 
> {{InvertibleBloomReconciler}} implementation.  From a black box perspective, 
> {{InvertibleBloomReconcilerTest.verifyFiveWayReconciliation()}} demonstrates 
> how this data structure and algorithm could be used.
> 2. ("db layer") Wrap {{InvertibleBloomReconciler}} into 
> {{TombstoneReconciler}}, a structure for spilling tombstones into.  Refactor 
> large swaths of {{org.apache.cassandra.db}}'s read path to accomodate the 
> possibility of placing tombstones discovered during a read into a 
> {{TombstoneReconciler}} instead of returning them within a {{Row}}, 
> {{List<Row>}}, {{ColumnFamily}}, etc.  I've attempted a start on this, though 
> this means carrying the {{TombstoneReconciler}} around through most of 
> {{ColumnFamilyStore}}, practically all of {{org.apache.db.filter}}, and other 
> places I haven't yet discovered.  I'm wondering if I wouldn't be better off 
> waiting for CASSANDRA-8099 before starting this step -- a fully iterator flow 
> through {{org.apache.cassandra.db}} could make this easier, cleaner, and have 
> significantly lower code impact.
> 3. ("dynamo layer") Arrange for {{RangeSliceCommand}} to include parameters 
> for the IBR (table size, hash count, seed), possibly by making these part of 
> {{CFMetaData}}.  Allow a {{RangeSliceResponse}} to optionally return a 
> {{TombstoneReconciler}} in addition to its {{List<Row>}}.  Refactor 
> {{RangeSliceResponseResolver}} to be capable of handling 
> {{TombstoneReconciler}} s.  Possibly refactor 
> {{StorageProxy.getRangeSlices(...)}} to disable read repair if any responses 
> contained a {{TombstoneReconciler}}.
> Since the invertible bloom filter is a probabilistic data structure, it is 
> possible that resolving a query in this manner could fail.  As such, I'm 
> proposing that the current read path not make use of the 
> {{TombstoneReconciler}} idea unless it would otherwise encounter a 
> {{TombstoneOverwhelming}} situation.
> Also, there is an astronomically minute possibility (on the order if 1 / 
> 2^120 ) that tombstones reconciled via a {{TombstoneReconciler}} could be bad 
> data.  Possibly less astronomical if someone were attempting to maliciously 
> abuse this functionality, given the fact that I'm using the Murmur3 hash 
> instead of a cryptographic hash within {{InvertibleBloomReconciler}}.  As 
> such, I'm suggesting that read repair not be enabled on this path, just in 
> case.
> ----
> Any thoughts?  Would there be any interest in an implementation as I've 
> described?  Any suggestions on who I might ask for help navigating 
> {{org.apache.cassandra.db}} if/when I run into trouble there?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to