[
https://issues.apache.org/jira/browse/CASSANDRA-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shawn Walker updated CASSANDRA-8542:
------------------------------------
Comment: was deleted
(was: As written, the current {{InvertibleBloomReconciler}} uses finite field
arithmetic over the prime field {{GF[2^30 - 35]}}. This is a trade off:
* (pro) Addition, multiplication, subtraction are fairly fast, and
comprehensible.
* (pro) During reconciliation, we can actually determine what sources produced
each non-common record, rather than just discovering what the differences were.
* (pro) We can reconcile up to 28 sources at once.
* (con) The way I've encoded from bytes into {{GF[2^30 - 35]}} is not space
efficient, creating 4+ bytes of output for each 3 bytes of input.
* (con) Division is a bit slow.
Similar techniques could be used with a different finite field. {{GF[2^16]}}
is a possible choice. For it:
* (pro) Addition and subtraction become just XOR, and so should be blazingly
fast.
* (pro) Encoding is trivial, and has nearly perfect space efficiency: 1-2 bytes
of padding per record should suffice.
* (con) We can only reconcile 16 sources at once.
* (con) Performing multiplication/division in {{GF[2^16]}} can be a bit arcane.
That said, it can be done quickly with a few lookup tables.* (con) During
reconciliation we wouldn't be able to determine what sources produced each
non-common record, only which records are the non-common records
)
> Make get_range_slices and related succeed most of the time on tombstone heavy
> column families
> ---------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-8542
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8542
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Shawn Walker
> Priority: Minor
> Attachments: trunk-8542-InvertibleBloomReconciler.txt
>
>
> I've got a ColumnFamily in which some rows end up being used in a queue-like
> fashion, and where many tombstones tend to pile up at the left end of the
> row. As a result, I run into {{TombstoneOverwhelming}} (e.g. CASSANDRA-6117)
> when trying to list the columns of said rows.
> Please don't yell at me too loudly. I know the issue I'm describing will
> generally be considered as being due to poor use of Cassandra. I understand
> the rationale of the current behavior, and am hesitant to play with fire by
> increasing {{tombstone_fail_threshold}} to a high value. Instead, what I
> propose is an alternate method of resolving such queries on the read path.
> ----
> This is based on the following observation: on the coordinator node, when
> {{RangeSliceResponseResolver}} is resolving a range slice query, it needn't
> be aware of all tombstones that all responding nodes have within the
> specified range. Instead, it would suffice if it could determine only those
> tombstones which aren't shared by all responding nodes. E.g. if node #1
> responds with tombstones (A, B, D), node #2 responds with tombstones (A, B,
> C), and node #3 responds with tombstones (A, B, C, D),
> {{RangeSliceResponseResolver}} need only actually know about the tombstones
> (C, D): All of the responders will already have removed any relevant data for
> the tombstones (A, B) from their individual responses.
> Arranging for {{RangeSliceResponseResolver}} to discover only the non-common
> tombstones can be accomplished by using a variant of the "invertible bloom
> filter" data structure described in e.g.
> http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p218.pdf. Using
> an invertible Bloom filter, each responding node can build a (roughly) fixed
> size data structure holding a representation of all the tombstones that node
> encounters. The coordinator node can then combine the invertible Bloom
> filters. If there aren't too many non-common tombstones, the coordinator
> node will be able to enumerate all of them, and so resolve the range slice
> query.
> I see accomplishing this in a few discrete steps:
> 1. Implement an appropriate variant of the "invertible bloom filter". I've
> started this already, and will shortly upload a patch including my
> {{InvertibleBloomReconciler}} implementation. From a black box perspective,
> {{InvertibleBloomReconcilerTest.verifyFiveWayReconciliation()}} demonstrates
> how this data structure and algorithm could be used.
> 2. ("db layer") Wrap {{InvertibleBloomReconciler}} into
> {{TombstoneReconciler}}, a structure for spilling tombstones into. Refactor
> large swaths of {{org.apache.cassandra.db}}'s read path to accomodate the
> possibility of placing tombstones discovered during a read into a
> {{TombstoneReconciler}} instead of returning them within a {{Row}},
> {{List<Row>}}, {{ColumnFamily}}, etc. I've attempted a start on this, though
> this means carrying the {{TombstoneReconciler}} around through most of
> {{ColumnFamilyStore}}, practically all of {{org.apache.db.filter}}, and other
> places I haven't yet discovered. I'm wondering if I wouldn't be better off
> waiting for CASSANDRA-8099 before starting this step -- a fully iterator flow
> through {{org.apache.cassandra.db}} might make this easier.
> 3. ("dynamo layer") Arrange for {{RangeSliceCommand}} to include parameters
> for the IBR (table size, hash count, seed), possibly by making these part of
> {{CFMetaData}}. Allow a {{RangeSliceResponse}} to optionally return a
> {{TombstoneReconciler}} in addition to its {{List<Row>}}. Refactor
> {{RangeSliceResponseResolver}} to be capable of handling
> {{TombstoneReconciler}} s. Possibly refactor
> {{StorageProxy.getRangeSlices(...)}} to disable read repair if any responses
> contained a {{TombstoneReconciler}}.
> Since the invertible bloom filter is a probabilistic data structure, it is
> possible that resolving a query in this manner could fail. As such, I'm
> proposing that the current read path not make use of the
> {{TombstoneReconciler}} idea unless it would otherwise encounter a
> {{TombstoneOverwhelming}} situation.
> Also, there is an astronomically minute possibility (on the order if 1 /
> 2^120 ) that tombstones reconciled via a {{TombstoneReconciler}} could be bad
> data. Possibly less astronomical if someone were attempting to maliciously
> abuse this functionality, given the fact that I'm using the Murmur3 hash
> instead of a cryptographic hash within {{InvertibleBloomReconciler}}. As
> such, I'm suggesting that read repair not be enabled on this path, just in
> case.
> ----
> Any thoughts? Would there be any interest in an implementation as I've
> described? Any suggestions on who I might ask for help navigating
> {{org.apache.cassandra.db}} if/when I run into trouble there?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)