[jira] [Commented] (CASSANDRA-8844) Change Data Capture (CDC)

Joshua McKenzie (JIRA) Thu, 24 Dec 2015 04:09:06 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070945#comment-15070945
 ]


Joshua McKenzie commented on CASSANDRA-8844:
--------------------------------------------

bq. the user should not have to write to the table in any special or prescribed 
way in order to get the mutations captured. Requiring the client to write at 
CL_QUORUM of CDC is enabled is not okay.
If we're offering an ==RF level guarantee for CDC consistency at the granular, 
per-mutation level, that would mean transparently upgrading all CL_ONE to 
CL_QUORUM and ignoring the client's requested write consistency. We have to 
keep the write level from a CommitLog perspective in sync with the 
CDC-perspective or we run the risk of ending up with the inverse scenario, 
where we have CDC data on a replica that never receives a base mutation (though 
unlikely).

bq. that the consistency/replication of the CDC log now has a completely 
different guarantee than the base table (system, actually
That depends on what the stated goals/features/guarantees of CDC are. This gets 
down to the question of which of the 2 (or both) things are requirements with 
regard to CDC as a feature:
# A one-way data replication mechanism that's fast, lightweight, real-time, and 
integrated, or
# A record of every single mutation to a key that's occurred on a cluster, 
regardless of whether the mutations are on a node that goes down

The original proposal(s) would suggest a relaxation of the guarantee on point 2 
in favor of drastically reducing the complexity and shortening the delivery 
time of the feature.

bq. a situation that Cassandra covers just fine and results in an eventually 
consistent state. However, it will result in CDC loss which is not acceptable.
There's 2 things here. First, Cassandra covers this situation "just fine" 
through some heavy-weight logical operations and relaxed (re: eventual) 
consistency guarantees. Merkle trees, repair, streaming, etc - things that are 
consistently hot spots for code complexity, cluster load, and ops headaches and 
failure. Second, *from the perspective of the replica* the CDC log is an exact 
copy of what's taking place on the replica w/regards to mutations to keys.

I say all this, not to advocate for the "CDC reflects CL, not history of 
mutations", but rather to offer some logical counterpoints and consequences to 
your rather strongly worded statements above.

While it is technically feasible for us to also merkle/sync CDC logs (or use 
some other byte-level synchronization, though I can't think of any other really 
viable approach w/potential small missing mutations in the middle of a log), 
IMO this is opening pandora's box and better resolved by reflecting the CL of a 
query in the CL of the CDC data. Otherwise, we will have to start tracking 
whether or not CDC-state is synchronized between replicas, reflect that sync in 
the life-cycle of CDC logs on top of whether or not they've yet been consumed 
by a local daemon/consumer, and incorporate a repair mechanism to get CDC 
records in sync between those replicas as well (much like Ariel was alluding to 
above).

At face value, to me it makes a lot more sense to either a) tell users of this 
limitation of CDC (if you write at CL.ONE, you may miss interim CDC-state on 
data if you lose the one node that got it) or b) upgrade mutations on a 
CDC-enabled CF to QUORUM. The ROI of either of these two approaches is far 
greater than the alternative in my opinion.

> Change Data Capture (CDC)
> -------------------------
>
>                 Key: CASSANDRA-8844
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8844
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Coordination, Local Write-Read Paths
>            Reporter: Tupshin Harper
>            Assignee: Joshua McKenzie
>            Priority: Critical
>             Fix For: 3.x
>
>
> "In databases, change data capture (CDC) is a set of software design patterns 
> used to determine (and track) the data that has changed so that action can be 
> taken using the changed data. Also, Change data capture (CDC) is an approach 
> to data integration that is based on the identification, capture and delivery 
> of the changes made to enterprise data sources."
> -Wikipedia
> As Cassandra is increasingly being used as the Source of Record (SoR) for 
> mission critical data in large enterprises, it is increasingly being called 
> upon to act as the central hub of traffic and data flow to other systems. In 
> order to try to address the general need, we (cc [~brianmhess]), propose 
> implementing a simple data logging mechanism to enable per-table CDC patterns.
> h2. The goals:
> # Use CQL as the primary ingestion mechanism, in order to leverage its 
> Consistency Level semantics, and in order to treat it as the single 
> reliable/durable SoR for the data.
> # To provide a mechanism for implementing good and reliable 
> (deliver-at-least-once with possible mechanisms for deliver-exactly-once ) 
> continuous semi-realtime feeds of mutations going into a Cassandra cluster.
> # To eliminate the developmental and operational burden of users so that they 
> don't have to do dual writes to other systems.
> # For users that are currently doing batch export from a Cassandra system, 
> give them the opportunity to make that realtime with a minimum of coding.
> h2. The mechanism:
> We propose a durable logging mechanism that functions similar to a commitlog, 
> with the following nuances:
> - Takes place on every node, not just the coordinator, so RF number of copies 
> are logged.
> - Separate log per table.
> - Per-table configuration. Only tables that are specified as CDC_LOG would do 
> any logging.
> - Per DC. We are trying to keep the complexity to a minimum to make this an 
> easy enhancement, but most likely use cases would prefer to only implement 
> CDC logging in one (or a subset) of the DCs that are being replicated to
> - In the critical path of ConsistencyLevel acknowledgment. Just as with the 
> commitlog, failure to write to the CDC log should fail that node's write. If 
> that means the requested consistency level was not met, then clients *should* 
> experience UnavailableExceptions.
> - Be written in a Row-centric manner such that it is easy for consumers to 
> reconstitute rows atomically.
> - Written in a simple format designed to be consumed *directly* by daemons 
> written in non JVM languages
> h2. Nice-to-haves
> I strongly suspect that the following features will be asked for, but I also 
> believe that they can be deferred for a subsequent release, and to guage 
> actual interest.
> - Multiple logs per table. This would make it easy to have multiple 
> "subscribers" to a single table's changes. A workaround would be to create a 
> forking daemon listener, but that's not a great answer.
> - Log filtering. Being able to apply filters, including UDF-based filters 
> would make Casandra a much more versatile feeder into other systems, and 
> again, reduce complexity that would otherwise need to be built into the 
> daemons.
> h2. Format and Consumption
> - Cassandra would only write to the CDC log, and never delete from it. 
> - Cleaning up consumed logfiles would be the client daemon's responibility
> - Logfile size should probably be configurable.
> - Logfiles should be named with a predictable naming schema, making it 
> triivial to process them in order.
> - Daemons should be able to checkpoint their work, and resume from where they 
> left off. This means they would have to leave some file artifact in the CDC 
> log's directory.
> - A sophisticated daemon should be able to be written that could 
> -- Catch up, in written-order, even when it is multiple logfiles behind in 
> processing
> -- Be able to continuously "tail" the most recent logfile and get 
> low-latency(ms?) access to the data as it is written.
> h2. Alternate approach
> In order to make consuming a change log easy and efficient to do with low 
> latency, the following could supplement the approach outlined above
> - Instead of writing to a logfile, by default, Cassandra could expose a 
> socket for a daemon to connect to, and from which it could pull each row.
> - Cassandra would have a limited buffer for storing rows, should the listener 
> become backlogged, but it would immediately spill to disk in that case, never 
> incurring large in-memory costs.
> h2. Additional consumption possibility
> With all of the above, still relevant:
> - instead (or in addition to) using the other logging mechanisms, use CQL 
> transport itself as a logger.
> - Extend the CQL protoocol slightly so that rows of data can be return to a 
> listener that didn't explicit make a query, but instead registered itself 
> with Cassandra as a listener for a particular event type, and in this case, 
> the event type would be anything that would otherwise go to a CDC log.
> - If there is no listener for the event type associated with that log, or if 
> that listener gets backlogged, the rows will again spill to the persistent 
> storage.
> h2. Possible Syntax
> {code:sql}
> CREATE TABLE ... WITH CDC LOG
> {code}
> Pros: No syntax extesions
> Cons: doesn't make it easy to capture the various permutations (i'm happy to 
> be proven wrong) of per-dc logging. also, the hypothetical multiple logs per 
> table would break this
> {code:sql}
> CREATE CDC_LOG mylog ON mytable WHERE MyUdf(mycol1, mycol2) = 5 with 
> DCs={'dc1','dc3'}
> {code}
> Pros: Expressive and allows for easy DDL management of all aspects of CDC
> Cons: Syntax additions. Added complexity, partly for features that might not 
> be implemented



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8844) Change Data Capture (CDC)

Reply via email to