[GitHub] [kafka-site] ryannedolan commented on a change in pull request #324: KAFKA-8930: MirrorMaker v2 documentation

GitBox Sat, 23 Jan 2021 10:30:51 -0800


ryannedolan commented on a change in pull request #324:
URL: https://github.com/apache/kafka-site/pull/324#discussion_r563180323




##########
File path: 27/ops.html
##########
@@ -553,7 +539,558 @@ <h3 class="anchor-heading"><a id="datacenters" 
class="anchor-link"></a><a href="
   <p>
   It is generally <i>not</i> advisable to run a <i>single</i> Kafka cluster 
that spans multiple datacenters over a high-latency link. This will incur very 
high replication latency both for Kafka writes and ZooKeeper writes, and 
neither Kafka nor ZooKeeper will remain available in all locations if the 
network between locations is unavailable.
 
-  <h3 class="anchor-heading"><a id="config" class="anchor-link"></a><a 
href="#config">6.3 Kafka Configuration</a></h3>
+  <h3 class="anchor-heading"><a id="georeplication" class="anchor-link"></a><a 
href="#georeplication">6.3 Geo-Replication (Cross-Cluster Data 
Mirroring)</a></h3>
+
+  <h4 class="anchor-heading"><a id="georeplication-overview" 
class="anchor-link"></a><a href="#georeplication-overview">Geo-Replication 
Overview</a></h4>
+
+  <p>
+    Kafka administrators can define data flows that cross the boundaries of 
individual Kafka clusters, data centers, or geo-regions. Such event streaming 
setups are often needed for organizational, technical, or legal requirements. 
Common scenarios include:
+  </p>
+
+  <ul>
+    <li>Geo-replication</li>
+    <li>Disaster recovery</li>
+    <li>Feeding edge clusters into a central, aggregate cluster</li>
+    <li>Physical isolation of clusters (such as production vs. testing)</li>
+    <li>Cloud migration or hybrid cloud deployments</li>
+    <li>Legal and compliance requirements</li>
+  </ul>
+
+  <p>
+    Administrators can set up such inter-cluster data flows with Kafka's 
MirrorMaker (version 2), a tool to replicate data between different Kafka 
environments in a streaming manner. MirrorMaker is built on top of the Kafka 
Connect framework and supports features such as:
+  </p>
+
+  <ul>
+    <li>Replicates topics (data plus configurations)</li>
+    <li>Replicates consumer groups including offsets to migrate applications 
between clusters</li>
+    <li>Replicates ACLs</li>
+    <li>Preserves partitioning</li>
+    <li>Automatically detects new topics and partitions</li>
+    <li>Provides a wide range of metrics, such as end-to-end replication 
latency across multiple data centers/clusters</li>
+    <li>Fault-tolerant and horizontally scalable operations</li>
+  </ul>
+
+  <p>
+  <em>Note: Geo-replication with MirrorMaker replicates data across Kafka 
clusters. This inter-cluster replication is different from Kafka's <a 
href="#replication">intra-cluster replication</a>, which replicates data within 
the same Kafka cluster.</em>
+  </p>
+
+  <h4 class="anchor-heading"><a id="georeplication-flows" 
class="anchor-link"></a><a href="#georeplication-flows">What Are Replication 
Flows</a></h4>
+
+  <p>
+    With MirrorMaker, Kafka administrators can replicate topics, topic 
configurations, consumer groups and their offsets, and ACLs from one or more 
source Kafka clusters to one or more target Kafka clusters, i.e., across 
cluster environments. In a nutshell, MirrorMaker consumes data from the source 
cluster with source connectors, and then replicates the data by producing to 
the target cluster with sink connectors.

Review comment:
       "with sink connectors" is not true at the moment, since I don't think we 
have a sink connector yet. And even when we do, it would usually be sufficient 
to use source _or_ sink connector. There are certainly cases where this 
sentence is true, but I think it's misleading as a general statement.
   
   Maybe "In a nutshell, MirrorMaker uses Connectors to consume from source 
clusters and produce to target clusters" or something like that.

##########
File path: 27/ops.html
##########
@@ -553,7 +539,558 @@ <h3 class="anchor-heading"><a id="datacenters" 
class="anchor-link"></a><a href="
   <p>
   It is generally <i>not</i> advisable to run a <i>single</i> Kafka cluster 
that spans multiple datacenters over a high-latency link. This will incur very 
high replication latency both for Kafka writes and ZooKeeper writes, and 
neither Kafka nor ZooKeeper will remain available in all locations if the 
network between locations is unavailable.
 
-  <h3 class="anchor-heading"><a id="config" class="anchor-link"></a><a 
href="#config">6.3 Kafka Configuration</a></h3>
+  <h3 class="anchor-heading"><a id="georeplication" class="anchor-link"></a><a 
href="#georeplication">6.3 Geo-Replication (Cross-Cluster Data 
Mirroring)</a></h3>
+
+  <h4 class="anchor-heading"><a id="georeplication-overview" 
class="anchor-link"></a><a href="#georeplication-overview">Geo-Replication 
Overview</a></h4>
+
+  <p>
+    Kafka administrators can define data flows that cross the boundaries of 
individual Kafka clusters, data centers, or geo-regions. Such event streaming 
setups are often needed for organizational, technical, or legal requirements. 
Common scenarios include:
+  </p>
+
+  <ul>
+    <li>Geo-replication</li>
+    <li>Disaster recovery</li>
+    <li>Feeding edge clusters into a central, aggregate cluster</li>
+    <li>Physical isolation of clusters (such as production vs. testing)</li>
+    <li>Cloud migration or hybrid cloud deployments</li>
+    <li>Legal and compliance requirements</li>
+  </ul>
+
+  <p>
+    Administrators can set up such inter-cluster data flows with Kafka's 
MirrorMaker (version 2), a tool to replicate data between different Kafka 
environments in a streaming manner. MirrorMaker is built on top of the Kafka 
Connect framework and supports features such as:
+  </p>
+
+  <ul>
+    <li>Replicates topics (data plus configurations)</li>
+    <li>Replicates consumer groups including offsets to migrate applications 
between clusters</li>
+    <li>Replicates ACLs</li>
+    <li>Preserves partitioning</li>
+    <li>Automatically detects new topics and partitions</li>
+    <li>Provides a wide range of metrics, such as end-to-end replication 
latency across multiple data centers/clusters</li>
+    <li>Fault-tolerant and horizontally scalable operations</li>
+  </ul>
+
+  <p>
+  <em>Note: Geo-replication with MirrorMaker replicates data across Kafka 
clusters. This inter-cluster replication is different from Kafka's <a 
href="#replication">intra-cluster replication</a>, which replicates data within 
the same Kafka cluster.</em>
+  </p>
+
+  <h4 class="anchor-heading"><a id="georeplication-flows" 
class="anchor-link"></a><a href="#georeplication-flows">What Are Replication 
Flows</a></h4>
+
+  <p>
+    With MirrorMaker, Kafka administrators can replicate topics, topic 
configurations, consumer groups and their offsets, and ACLs from one or more 
source Kafka clusters to one or more target Kafka clusters, i.e., across 
cluster environments. In a nutshell, MirrorMaker consumes data from the source 
cluster with source connectors, and then replicates the data by producing to 
the target cluster with sink connectors.
+  </p>
+
+  <p>
+    These directional flows from source to target clusters are called 
replication flows. They are defined with the format 
<code>{source_cluster}->{target_cluster}</code> in the MirrorMaker 
configuration file as described later. Administrators can create complex 
replication topologies based on these flows.
+  </p>
+
+  <p>
+    Here are some example patterns:
+  </p>
+
+  <ul>
+    <li>Active/Active high availability deployments: <code>A->B, 
B->A</code></li>
+    <li>Active/Passive or Active/Standby high availability deployments: 
<code>A->B</code></li>
+    <li>Aggregation (e.g., from many clusters to one): <code>A->K, B->K, 
C->K</code></li>
+    <li>Fan-out (e.g., from one to many clusters): <code>K->A, K->B, 
K->C</code></li>
+    <li>Forwarding: <code>A->B, B->C, C->D</code></li>
+  </ul>
+
+  <p>
+    By default, a flow replicates all topics and consumer groups. However, 
each replication flow can be configured independently. For instance, you can 
define that only specific topics or consumer groups are replicated from the 
source cluster to the target cluster.
+  </p>
+
+  <p>
+    Here is a first example on how to configure data replication from a 
<code>primary</code> cluster to a <code>secondary</code> cluster (an 
active/passive setup):
+  </p>
+
+<pre class="line-numbers"><code class="language-text"># Basic settings
+clusters = primary, secondary
+primary.bootstrap.servers = broker3-primary:9092
+secondary.bootstrap.servers = broker5-secondary:9092
+
+# Define replication flows
+primary->secondary.enable = true
+primary->secondary.topics = foobar-topic, quux-.*
+</code></pre>
+
+
+  <h4 class="anchor-heading"><a id="georeplication-mirrormaker" 
class="anchor-link"></a><a href="#georeplication-mirrormaker">Configuring 
Geo-Replication</a></h4>
+
+  <p>
+    The following sections describe how to configure and run a dedicated 
MirrorMaker cluster. If you want to run MirrorMaker within an existing Kafka 
Connect cluster or other supported deployment setups, please refer to <a 
href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0";>KIP-382:
 MirrorMaker 2.0</a> and be aware that the names of configuration settings may 
vary between deployment modes.
+  </p>
+
+  <p>
+    Beyond what's covered in the following sections, further examples and 
information on configuration settings are available at:
+  </p>
+
+  <ul>
+         <li><a 
href="https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorMakerConfig.java";>MirrorMakerConfig</a>,
 <a 
href="https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorConnectorConfig.java";>MirrorConnectorConfig</a></li>
+         <li><a 
href="https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/DefaultTopicFilter.java";>DefaultTopicFilter</a>
 for topics, <a 
href="https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/DefaultGroupFilter.java";>DefaultGroupFilter</a>
 for consumer groups</li>
+         <li>Example configuration settings in <a 
href="https://github.com/apache/kafka/blob/trunk/config/connect-mirror-maker.properties";>connect-mirror-maker.properties</a>,
 <a 
href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0";>KIP-382:
 MirrorMaker 2.0</a></li>
+  </ul>
+
+  <h5 class="anchor-heading"><a id="georeplication-config-syntax" 
class="anchor-link"></a><a href="#georeplication-config-syntax">Configuration 
File Syntax</a></h5>
+
+  <p>
+    The MirrorMaker configuration file is typically named 
<code>connect-mirror-maker.properties</code>. You can configure a variety of 
components in this file:
+  </p>
+
+  <ul>
+    <li>MirrorMaker settings: global settings including cluster definitions 
(aliases), plus custom settings per replication flow</li>
+    <li>Kafka Connect and connector settings</li>
+    <li>Kafka producer, consumer, and admin client settings</li>
+  </ul>
+
+  <p>
+    Example: Define MirrorMaker settings (explained in more detail later).
+  </p>
+
+<pre class="line-numbers"><code class="language-text"># Global settings
+clusters = us-west, us-east   # defines cluster aliases
+us-west.bootstrap.servers = broker3-west:9092
+us-east.bootstrap.servers = broker5-east:9092
+
+topics = .*   # all topics to be replicated by default
+
+# Specific replication flow settings (here: flow from us-west to us-east)
+us-west->us-east.enable = true
+us-west->us.east.topics = foo.*, bar.*  # override the default above
+</code></pre>
+
+  <p>
+    MirrorMaker is based on the Kafka Connect framework. Any Kafka Connect, 
source connector, and sink connector settings as described in the <a 
href="#connectconfigs">documentation chapter on Kafka Connect</a> can be used 
directly in the MirrorMaker configuration, without having to change or prefix 
the name of the configuration setting.
+  </p>
+
+  <p>
+    Example: Define custom Kafka Connect settings to be used by MirrorMaker.
+  </p>
+
+<pre class="line-numbers"><code class="language-text"># Setting Kafka Connect 
defaults for MirrorMaker
+tasks.max = 5
+</code></pre>
+
+  <p>
+  Most of the default Kafka Connect settings work well for MirrorMaker 
out-of-the-box, with the exception of <code>tasks.max</code>. In order to 
evenly distribute the workload across more than one MirrorMaker process, it is 
recommended to set <code>tasks.max</code> to at least <code>2</code> 
(preferably higher) depending on the available hardware resources and the total 
number of topic-partitions to be replicated.
+  </p>
+
+  <p>
+  You can further customize MirrorMaker's Kafka Connect settings <em>per 
source or target cluster</em> (more precisely, you can specify Kafka Connect 
worker-level configuration settings "per connector"). Use the format of 
<code>{cluster}.{config_name}</code> in the MirrorMaker configuration file.
+  </p>
+
+  <p>
+    Example: Define custom connector settings for the <code>us-west</code> 
cluster.
+  </p>
+
+<pre class="line-numbers"><code class="language-text"># us-west custom settings
+us-west.offset.storage.topic = my-mirrormaker-offsets
+</code></pre>
+
+  <p>
+    MirrorMaker internally uses the Kafka producer, consumer, and admin 
clients. Custom settings for these clients are often needed. To override the 
defaults, use the following format in the MirrorMaker configuration file:
+  </p>
+
+  <ul>
+    <li><code>{source}.consumer.{consumer_config_name}</code></li>
+    <li><code>{target}.producer.{producer_config_name}</code></li>
+    <li><code>{source_or_target}.admin.{admin_config_name}</code></li>
+  </ul>
+
+  <p>
+    Example: Define custom producer, consumer, admin client settings.
+  </p>
+
+<pre class="line-numbers"><code class="language-text"># us-west cluster (from 
which to consume)
+us-west.consumer.isolation.level = read_committed
+us-west.admin.bootstrap.servers = broker57-primary:9092
+
+# us-east cluster (to which to produce)
+us-east.producer.compression.type = gzip
+us-east.producer.buffer.memory = 32768
+us-east.admin.bootstrap.servers = broker8-secondary:9092
+</code></pre>
+
+  <h5 class="anchor-heading"><a id="georeplication-flow-create" 
class="anchor-link"></a><a href="#georeplication-flow-create">Creating and 
Enabling Replication Flows</a></h5>
+
+  <p>
+    To define a replication flow, you must first define the respective source 
and target Kafka clusters in the MirrorMaker configuration file.
+  </p>
+
+  <ul>
+    <li><code>clusters</code> (required): comma-separated list of Kafka 
cluster "aliases"</li>
+    <li><code>{clusterAlias}.bootstrap.servers</code> (required): connection 
information for the specific cluster; comma-separated list of "bootstrap" Kafka 
brokers
+  </ul>
+
+  <p>
+    Example: Define two cluster aliases <code>primary</code> and 
<code>secondary</code>, including their connection information.
+  </p>
+
+<pre class="line-numbers"><code class="language-text">clusters = primary, 
secondary
+primary.bootstrap.servers = broker10-primary:9092,broker-11-primary:9092
+secondary.bootstrap.servers = broker5-secondary:9092,broker6-secondary:9092
+</code></pre>
+
+  <p>
+    Secondly, you must explicitly enable individual replication flows with 
<code>{source}->{target}.enabled = true</code> as needed. Remember that flows 
are directional: if you need two-way (bidirectional) replication, you must 
enable flows in both directions.
+  </p>
+
+<pre class="line-numbers"><code class="language-text"># Enable replication 
from primary to secondary
+primary->secondary.enable = true
+</code></pre>
+
+  <p>
+    By default, a replication flow will replicate all but a few special topics 
and consumer groups from the source cluster to the target cluster, and 
automatically detect any newly created topics and groups. The names of 
replicated topics in the target cluster will be prefixed with the name of the 
source cluster (see section further below). For example, the topic 
<code>foo</code> in the source cluster <code>us-west</code> would be replicated 
to a topic named <code>us-west.foo</code> in the target cluster 
<code>us-east</code>.
+  </p>
+
+  <p>
+    The subsequent sections explain how to customize this basic setup 
according to your needs.
+  </p>
+
+  <h5 class="anchor-heading"><a id="georeplication-flow-configure" 
class="anchor-link"></a><a href="#georeplication-flow-configure">Configuring 
Replication Flows</a></h5>
+
+  <p>
+The configuration of a replication flow is a combination of top-level default 
settings (e.g., <code>topics</code>), on top of which flow-specific settings, 
if any, are applied (e.g., <code>us-west->us-east.topics</code>). To change the 
top-level defaults, add the respective top-level setting to the MirrorMaker 
configuration file. To override the defaults for a specific replication flow 
only, use the syntax format <code>{source}->{target}.{config.name}</code>.
+  </p>
+
+  <p>
+    The most important settings are:
+  </p>
+
+  <ul>
+    <li><code>topics</code>: list of topics or a regular expression that 
defines which topics in the source cluster to replicate (default: <code>topics 
= .*</code>)
+    <li><code>topics.exclude</code>: list of topics or a regular expression to 
subsequently exclude topics that were matched by the <code>topics</code> 
setting (default: <code>topics.exclude = .*[\-\.]internal, .*\.replica, 
__.*</code>)
+    <li><code>groups</code>: list of topics or regular expression that defines 
which consumer groups in the source cluster to replicate (default: <code>groups 
= .*</code>)
+    <li><code>groups.exclude</code>: list of topics or a regular expression to 
subsequently exclude consumer groups that were matched by the 
<code>groups</code> setting (default: <code>groups.exclude = 
console-consumer-.*, connect-.*, __.*</code>)
+    <li><code>{source}->{target}.enable</code>: set to <code>true</code> to 
enable the replication flow (default: <code>false</code>)
+  </ul>
+
+  <p>
+    Example:
+  </p>
+
+<pre class="line-numbers"><code class="language-text"># Custom top-level 
defaults that apply to all replication flows
+topics = .*
+groups = consumer-group1, consumer-group2
+
+# Don't forget to enable a flow!
+us-west->us-east.enable = true
+
+# Custom settings for specific replication flows
+us-west->us-east.topics = foo.*
+us-west->us-east.groups = bar.*
+us-west->us-east.emit.heartbeats = false
+</code></pre>
+
+  <p>
+    Additional configuration settings are supported, some of which are listed 
below. In most cases, you can leave these settings at their default values. See 
<a 
href="https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorMakerConfig.java";>MirrorMakerConfig</a>
 and <a 
href="https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorConnectorConfig.java";>MirrorConnectorConfig</a>
 for further details.
+  </p>
+
+  <ul>
+    <li><code>refresh.topics.enabled</code>: whether to check for new topics 
in the source cluster periodically (default: true)
+    <li><code>refresh.topics.interval.seconds</code>: frequency of checking 
for new topics in the source cluster; lower values than the default may lead to 
performance degradation (default: 6000, every ten minutes)
+    <li><code>refresh.groups.enabled</code>: whether to check for new consumer 
groups in the source cluster periodically (default: true)
+    <li><code>refresh.groups.interval.seconds</code>: frequency of checking 
for new consumer groups in the source cluster; lower values than the default 
may lead to performance degradation (default: 6000, every ten minutes)
+    <li><code>sync.topic.configs.enabled</code>: whether to replicate topic 
configurations from the source cluster (default: true)
+    <li><code>sync.topic.acls.enabled</code>: whether to sync ACLs from the 
source cluster (default: true)
+    <li><code>emit.heartbeats.enabled</code>: whether to emit heartbeats 
periodically (default: true)
+    <li><code>emit.heartbeats.interval.seconds</code>: frequency at which 
heartbeats are emitted (default: 5, every five seconds)
+    <li><code>heartbeats.topic.replication.factor</code>: replication factor 
of MirrorMaker's internal heartbeat topics (default: 3)
+    <li><code>emit.checkpoints.enabled</code>: whether to emit MirrorMaker's 
consumer offsets periodically (default: true)
+    <li><code>emit.checkpoints.interval.seconds</code>: frequency at which 
checkpoints are emitted (default: 60, every minute)
+    <li><code>checkpoints.topic.replication.factor</code>: replication factor 
of MirrorMaker's internal checkpoints topics (default: 3)
+    <li><code>sync.group.offsets.enabled</code>: whether to periodically write 
the translated offsets of replicated consumer groups (in the source cluster) to 
<code>__consumer_offsets</code> topic in target cluster, as long as no active 
consumers in that group are connected to the target cluster (default: true)
+    <li><code>sync.group.offsets.interval.seconds</code>: frequency at which 
consumer group offsets are synced (default: 60, every minute)
+    <li><code>offset-syncs.topic.replication.factor</code>: replication factor 
of MirrorMaker's internal offset-sync topics (default: 3)
+  </ul>
+
+  <h5 class="anchor-heading"><a id="georeplication-flow-secure" 
class="anchor-link"></a><a href="#georeplication-flow-secure">Securing 
Replication Flows</a></h5>
+
+  <p>
+    MirrorMaker supports the same <a href="#connectconfigs">security settings 
as Kafka Connect</a>, so please refer to the linked section for further 
information.
+  </p>
+
+  <p>
+    Example: Encrypt communication between MirrorMaker and the 
<code>us-east</code> cluster.
+  </p>
+
+<pre class="line-numbers"><code 
class="language-text">us-east.security.protocol=SSL
+us-east.ssl.truststore.location=/path/to/truststore.jks
+us-east.ssl.truststore.password=my-secret-password
+us-east.ssl.keystore.location=/path/to/keystore.jks
+us-east.ssl.keystore.password=my-secret-password
+us-east.ssl.key.password=my-secret-password
+</code></pre>
+
+  <h5 class="anchor-heading"><a id="georeplication-topic-naming" 
class="anchor-link"></a><a href="#georeplication-topic-naming">Custom Naming of 
Replicated Topics in Target Clusters</a></h5>
+
+  <p>
+    Replicated topics in a target cluster—sometimes called <em>remote</em> 
topics—are renamed according to a replication policy. MirrorMaker uses this 
policy to ensure that events (aka records, messages) from different clusters 
are not written to the same topic-partition. By default as per <a 
href="https://github.com/apache/kafka/blob/trunk/connect/mirror-client/src/main/java/org/apache/kafka/connect/mirror/DefaultReplicationPolicy.java";>DefaultReplicationPolicy</a>,
 the names of replicated topics in the target clusters have the format 
<code>{source}.{source_topic_name}</code>:

Review comment:
       I think "records" is more prevalent in Kafka docs vs "events". Maybe 
verify that and stick with whatever the rest of the docs use. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka-site] ryannedolan commented on a change in pull request #324: KAFKA-8930: MirrorMaker v2 documentation

Reply via email to