[camel-website] branch master updated: CDC blog post

zregvart Mon, 04 May 2020 03:55:10 -0700

This is an automated email from the ASF dual-hosted git repository.

zregvart pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/camel-website.git



The following commit(s) were added to refs/heads/master by this push:
     new c0f933e  CDC blog post
c0f933e is described below

commit c0f933eb1cac65c49b22dc2f83afd47d96b6300b
Author: Federico Valeri <fvaleri@localhost>
AuthorDate: Sun May 3 19:24:08 2020 +0200

    CDC blog post
---
 .../CdcWithCamelAndDebezium/camel-featured.jpg     | Bin 0 -> 122153 bytes
 content/blog/CdcWithCamelAndDebezium/index.md      | 190 +++++++++++++++++++++
 2 files changed, 190 insertions(+)

diff --git a/content/blog/CdcWithCamelAndDebezium/camel-featured.jpg 
b/content/blog/CdcWithCamelAndDebezium/camel-featured.jpg
new file mode 100644
index 0000000..89bdcff
Binary files /dev/null and 
b/content/blog/CdcWithCamelAndDebezium/camel-featured.jpg differ
diff --git a/content/blog/CdcWithCamelAndDebezium/index.md 
b/content/blog/CdcWithCamelAndDebezium/index.md
new file mode 100644
index 0000000..9914bbc
--- /dev/null
+++ b/content/blog/CdcWithCamelAndDebezium/index.md
@@ -0,0 +1,190 @@
+---
+title: "CDC with Camel and Debezium"
+date: 2020-05-04
+draft: false
+authors: [fvaleri]
+categories: ["Usecases"]
+preview: "CDC approaches based on Camel and Debezium."
+---
+
+Change Data Capture (CDC) is a well-established software design pattern for a 
system that monitors and captures
+the changes in data, so that other software can respond to those changes.
+
+Using a CDC engine like [Debezium](https://debezium.io) along with 
[Camel](https://camel.apache.org) integration
+framework, we can easily build data pipelines to bridge traditional data 
stores and new cloud-native event-driven
+architectures.
+
+The advantages of CDC comparing to a simple poll-based or query-based process 
are:
+
+- *All changes captured*: intermediary changes (updates, deletes) between two 
runs of the poll loop may be missed.
+- *Low overhead*: near real-time reaction to data changes avoids increased CPU 
load due to frequent polling.
+- *No data model impact*: timestamp columns to determine the last update of 
data are not needed.
+
+There are two main aproaches for building a CDC pipeline:
+
+The first approach is *configuration-driven* and runs on top of 
[KafkaConnect](https://kafka.apache.org/documentation/#connect),
+the streaming integration platform based on Kafka. The second approach is 
*code-driven* and it is purely implemented with Camel
+(no Kafka dependency).
+
+While KafkaConnect provides some *Connectors* for zero or low coding 
integrations, Camel's extensive collection of *Components*
+(300+) enables you to connect to all kinds of external systems. The great news 
is that these Components can now be used as
+Connectors thanks to a new sub-project called *CamelKafkaConnect* (will use 
the SJMS2 as an example).
+
+## Use case
+
+We want to focus on the technology, so the use case is relatively simple, but 
includes both routing and transformation
+logic. The requirement is to stream all new customers from a source table to 
XML and JSON sink queues.
+```
+                                     |---> (xml-sink-queue)
+(source-table) ---> [cdc-process] ---|
+                                     |---> (json-sink-queue)
+```
+
+## Implementations
+
+No matter what technology you use, the CDC process must run as a single thread 
to maintain ordering.
+Since Debezium records the log offset asyncronously, any final consumer of 
these changes must be idempotent.
+
+Important change event properties: `lsn` (offset) is the log sequence number 
that tracks the position in the database
+WAL (write ahead log), `txId` represents the identifier of the server 
transaction which caused the event, `ts_ms`
+represents the number of microseconds since Unix Epoch as the server time at 
which the transaction was committed.
+
+Prerequisites: Postgres 11, OpenJDK 1.8 and Maven 3.5+.
+
+[GET CODE HERE](https://github.com/fvaleri/cdc)
+
+### External systems setup
+
+First of all, you need to start Postgres (the procedure depends on your 
specific OS).
+
+Then, there is a simple script to create and initialize the database.
+This script can also be used to query the table and produce a stream of 
changes.
+```sh
+./run.sh --database
+./run.sh --query
+./run.sh --stream
+```
+
+Enable Postgres internal transaction log access (required by Debezium).
+```sh
+# postgresql.conf: configure replication slot
+wal_level = logical
+max_wal_senders = 1
+max_replication_slots = 1
+# pg_hba.conf: allow localhost replication to debezium user
+local   replication     cdcadmin                            trust
+host    replication     cdcadmin    127.0.0.1/32            trust
+host    replication     cdcadmin    ::1/128                 trust
+# add replication permission to user and enable previous values
+psql cdcdb
+ALTER ROLE cdcadmin WITH REPLICATION;
+ALTER TABLE cdc.customers REPLICA IDENTITY FULL;
+# restart Postgres
+```
+
+Start Artemis broker and open the [web console](http://localhost:8161/console) 
to check messages.
+```sh
+./run.sh --artemis
+# check status
+ps -ef | grep "[A]rtemis" | wc -l
+```
+
+### KafkaConnect CDC pipeline
+
+This is the KafkaConnect distributed mode architecture that we will configure 
to fit our use case.
+```
+SourceConnector --> KafkaConnectDM [Worker0JVM(TaskA0, TaskB0, TaskB1),...] 
--> SinkConnector
+                                |
+                    Kafka (offsets, config, status)
+```
+
+We will run all components on localhost, but ideally each one should run in a 
different host (physical, VM or container).
+Connect workers operate well in containers and in managed environments. Take a 
look at the [Strimzi](https://strimzi.io)
+project if you want to know how to easily operate Kafka and KafkaConnect on 
Kubernetes platform.
+
+We need a Kafka cluster up and running (3 ZooKeeper + 3 Kafka). This step also 
download and install all required Connectors
+(debezium-connector-postgres, camel-sjms2-kafka-connector) and dependencies.
+```sh
+./run.sh --kafka
+# check status
+ps -ef | grep "[Q]uorumPeerMain" | wc -l
+ps -ef | grep "[K]afka" | wc -l
+```
+
+Now we can start our 3-nodes KafkaConnect cluster in distributed mode (workers 
that are configured with matching `group.id`
+values automatically discover each other and form a cluster).
+```sh
+./run.sh --connect
+# check status
+ps -ef | grep "[C]onnectDistributed" | wc -l
+tail -n100 /tmp/kafka/logs/connect.log
+/tmp/kafka/bin/kafka-topics.sh --zookeeper localhost:2180 --list
+curl localhost:7070/connector-plugins | jq
+```
+
+The infrastructure is ready and we can finally configure our CDC pipeline.
+```sh
+# debezium source task (topic name == serverName.schemaName.tableName)
+curl -sX POST -H "Content-Type: application/json" localhost:7070/connectors -d 
@connect-cdc/src/main/connectors/dbz-source.json
+
+# jms sink tasks (powered by sjms2 component)
+curl -sX POST -H "Content-Type: application/json" localhost:7070/connectors -d 
@connect-cdc/src/main/connectors/json-jms-sink.json
+curl -sX POST -H "Content-Type: application/json" localhost:7070/connectors -d 
@connect-cdc/src/main/connectors/xml-jms-sink.json
+
+# status check
+curl -s localhost:7070/connectors | jq
+curl -s localhost:7070/connectors/dbz-source/status | jq
+curl -s localhost:7070/connectors/json-jms-sink/status | jq
+curl -s localhost:7070/connectors/xml-jms-sink/status | jq
+```
+
+Produce some more changes and check queues.
+```sh
+./run.sh --stream
+```
+
+### Camel CDC pipeline
+
+This is our Camel CDC pipeline designed using EIPs.
+```
+                                                                       |--> 
[format-converter] --> (xml-queue)
+(postgres-db) --> [dbz-endpoint] --> [type-converter]--> [multicast] --|
+                                                                       |--> 
[format-converter] --> (json-queue)
+```
+
+We use the *Debezium PostgreSQL Component* as the endpoint which creates an 
event-driven consumer.
+This is a wrapper around Debezium embedded engine which enables CDC without 
the need to maintain Kafka clusters.
+
+Compile and run the application.
+```sh
+mvn clean compile exec:java -f ./camel-cdc/pom.xml
+```
+
+Produce some more changes and check queues.
+```sh
+./run.sh --stream
+```
+
+## Considerations
+
+Both CDC solutions are perfectly valid but, depending on your experience, you 
may find one of them more convenient.
+If you already have a Kafka cluster, an implicit benefit of using KafkaConnect 
is that it stores the whole change log
+in a topic, so you can easily rebuild the application state if needed.
+
+Another benefit of running on top of KafkaConnect in  distributed mode is that 
you have a fault tolerant CDC process.
+It is possible to achieve the same by running the Camel process as
+[clustered singleton 
service](https://www.nicolaferraro.me/2017/10/17/creating-clustered-singleton-services-on-kubernetes)
+on top of Kubernetes.
+
+One thing to be aware of is that Debezium offers better performance because of 
the access to the internal transaction log,
+but there is no standard for it, so a change to the database implementation 
may require a rewrite of the corresponding plugin.
+This also means that every data source has its own procedure to enable access 
to its internal log.
+
+KafkaConnect single message transformations (SMTs) can be chained (sort of 
Unix pipeline) and extended with custom implementations,
+but they are actually designed for simple modifications. Long chains of SMTs 
are hard to maintain and reason about. Moreover, remember
+that tranformations are syncronous and applied at each message, so you can 
really slowdown the streaming pipeline with heavy processing
+or external service calls.
+
+In cases where you need to do heavy processing, split, enrich, aggregate 
records or call external services, you should use a stream
+processing layer between Connectors such as Kafka Streams or Camel. Just 
remember that Kafka Streams creates internal topics and you
+are forced to put transformed data back into some Kafka topic (data 
duplication), while this is just an option using Camel.

[camel-website] branch master updated: CDC blog post

Reply via email to