[jira] [Commented] (KAFKA-7500) MirrorMaker 2.0 (KIP-382)
[ https://issues.apache.org/jira/browse/KAFKA-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932616#comment-16932616 ] Qihong Chen commented on KAFKA-7500: Hi [~ryannedolan], Thanks for your quick response (y)(y)(y) Now I know what "primary" means. According to blog [A look inside Kafka MirrorMaker2|https://blog.cloudera.com/a-look-inside-kafka-mirrormaker-2/] by Renu Tewari, {quote}In MM2 only one connect cluster is needed for all the cross-cluster replications between a pair of datacenters. Now if we simply take a Kafka source and sink connector and deploy them in tandem to do replication, the data would need to hop through an intermediate Kafka cluster. MM2 avoids this unnecessary data copying by a direct passthrough from source to sink. {quote} This is exactly what I want! Do you have release schedule for *SinkConnector*, and *direct passthrough from source to sink* feature? > MirrorMaker 2.0 (KIP-382) > - > > Key: KAFKA-7500 > URL: https://issues.apache.org/jira/browse/KAFKA-7500 > Project: Kafka > Issue Type: New Feature > Components: KafkaConnect, mirrormaker >Affects Versions: 2.4.0 >Reporter: Ryanne Dolan >Assignee: Manikumar >Priority: Major > Labels: pull-request-available, ready-to-commit > Fix For: 2.4.0 > > Attachments: Active-Active XDCR setup.png > > > Implement a drop-in replacement for MirrorMaker leveraging the Connect > framework. > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0] > [https://github.com/apache/kafka/pull/6295] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-7500) MirrorMaker 2.0 (KIP-382)
[ https://issues.apache.org/jira/browse/KAFKA-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931743#comment-16931743 ] Qihong Chen edited comment on KAFKA-7500 at 9/17/19 7:22 PM: - Hi [~ryannedolan], just listened your Kafka Power Chat talk, Thanks! Have a follow up question about the last question (from me) you answered in the talk. You said you prefer dedicated MM2 cluster over running MM2 in connect cluster since you can use less number of clusters to do replications among multiple Kafka clusters. But there's no REST Api for a dedicated MM2 cluster that can provide the status of the replication streams, nor updating the replication configuration. Any changes to the configuration meaning update the config files and restart all MM2 instances, is that right? Or I missed it that dedicated MM2 cluster does provide REST API for admin and monitoring, if so, where is it? If my understanding is correct, we can archive the same thing with MM2 in connect cluster. Assume there are 3 Kafka clusters: A, B, and C. Set up a connect cluster against C (meaning all topics for connectors' data and states go to cluster C), then set up MM2 connectors to replicate data and metadata A -> B and B -> A. If this is correct, we can use the Kafka cluster C plus the connect cluster that running against Kafka cluster C to replicate data among more Kafka clusters, like A, B, and D, and even more. Of course, this needs more complicated configuration, which requires deeper understanding how the MM2 connectors work. In this scenario, the connect cluster provides REST API to admin and monitoring all the connectors. This will be useful for people can't use Stream Replication Manager from Cloudera or Kafka replicator from Confluent for some reason. Is this right? was (Author: qihong): Hi [~ryannedolan], just listened your Kafka Power Chat talk, Thanks! Have a follow up question about the last question (from me) you answered in the talk. You said you prefer dedicated MM2 cluster over running MM2 in connect cluster since you can use less number of clusters to do replications among multiple Kafka clusters. But there's no REST Api for a dedicated MM2 cluster that can provide the status of the replication streams, nor updating the replication configuration. Any changes to the configuration meaning update the config files and restart all MM2 instances, is that right? Or I missed it that dedicated MM2 cluster does provide REST API for admin and monitoring, if so, where is it? If my understanding is correct, we can archive the same thing with MM2 in connect cluster. Assume there are 3 Kafka clusters: A, B, and C. Set up a connect cluster against C (meaning all topics for connectors' data and states go to cluster C), then set up MM2 connectors to replicate data and metadata A -> B and B -> A. If this is correct, we can use the Kafka cluster C plus the connect cluster that running against Kafka cluster C to replicate data among more Kafka clusters, like A, B, and D, and even more. Of course, this needs more complicated configuration, which requires deeper understanding how the MM2 connectors work. In this scenario, the connect cluster provides REST API to admin and monitoring all the connectors. This will be useful for people can't use Stream Replication Manager from Cloudera or Kafka replicator from Confluent for some reason. Is this right? > MirrorMaker 2.0 (KIP-382) > - > > Key: KAFKA-7500 > URL: https://issues.apache.org/jira/browse/KAFKA-7500 > Project: Kafka > Issue Type: New Feature > Components: KafkaConnect, mirrormaker >Affects Versions: 2.4.0 >Reporter: Ryanne Dolan >Assignee: Manikumar >Priority: Major > Labels: pull-request-available, ready-to-commit > Fix For: 2.4.0 > > Attachments: Active-Active XDCR setup.png > > > Implement a drop-in replacement for MirrorMaker leveraging the Connect > framework. > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0] > [https://github.com/apache/kafka/pull/6295] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (KAFKA-7500) MirrorMaker 2.0 (KIP-382)
[ https://issues.apache.org/jira/browse/KAFKA-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931743#comment-16931743 ] Qihong Chen commented on KAFKA-7500: Hi [~ryannedolan], just listened your Kafka Power Chat talk, Thanks! Have a follow up question about the last question (from me) you answered in the talk. You said you prefer dedicated MM2 cluster over running MM2 in connect cluster since you can use less number of clusters to do replications among multiple Kafka clusters. But there's no REST Api for a dedicated MM2 cluster that can provide the status of the replication streams, nor updating the replication configuration. Any changes to the configuration meaning update the config files and restart all MM2 instances, is that right? Or I missed it that dedicated MM2 cluster does provide REST API for admin and monitoring, if so, where is it? If my understanding is correct, we can archive the same thing with MM2 in connect cluster. Assume there are 3 Kafka clusters: A, B, and C. Set up a connect cluster against C (meaning all topics for connectors' data and states go to cluster C), then set up MM2 connectors to replicate data and metadata A -> B and B -> A. If this is correct, we can use the Kafka cluster C plus the connect cluster that running against Kafka cluster C to replicate data among more Kafka clusters, like A, B, and D, and even more. Of course, this needs more complicated configuration, which requires deeper understanding how the MM2 connectors work. In this scenario, the connect cluster provides REST API to admin and monitoring all the connectors. This will be useful for people can't use Stream Replication Manager from Cloudera or Kafka replicator from Confluent for some reason. Is this right? > MirrorMaker 2.0 (KIP-382) > - > > Key: KAFKA-7500 > URL: https://issues.apache.org/jira/browse/KAFKA-7500 > Project: Kafka > Issue Type: New Feature > Components: KafkaConnect, mirrormaker >Affects Versions: 2.4.0 >Reporter: Ryanne Dolan >Assignee: Manikumar >Priority: Major > Labels: pull-request-available, ready-to-commit > Fix For: 2.4.0 > > Attachments: Active-Active XDCR setup.png > > > Implement a drop-in replacement for MirrorMaker leveraging the Connect > framework. > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0] > [https://github.com/apache/kafka/pull/6295] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (KAFKA-7500) MirrorMaker 2.0 (KIP-382)
[ https://issues.apache.org/jira/browse/KAFKA-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927863#comment-16927863 ] Qihong Chen edited comment on KAFKA-7500 at 9/11/19 6:13 PM: - [~ryannedolan] Thanks for all your efforts on MM2. Appreciate your input on the following questions. I'm trying to replicate topics from one cluster to another, include topic data and related consumers' offset. But I only see the topic data was replicated, not consumer offsets. Here's my mm2.properties {code:java} clusters = dr1, dr2 dr1.bootstrap.servers = 10.1.0.4:9092,10.1.0.5:9092,10.1.0.6:9092 dr2.bootstrap.servers = 10.2.0.4:9092,10.2.0.5:9092,10.2.0.6:9092 # only allow replication dr1 -> dr2 dr1->dr2.enabled = true dr1->dr2.topics = test.* dr1->dr2.groups = test.* dr1->dr2.emit.heartbeats.enabled = false dr2->dr1.enabled = false dr2->dr1.emit.heartbeats.enabled = false {code} Here's how I started MM2 cluster (dr2 as the nearby cluster) {code:java} nohup bin/connect-mirror-maker.sh mm2.properties --clusters dr2 > mm2.log 2>&1 & {code} On dr1, there's topic *test1*, and consumer group *test1grp* for topic test1. On dr2, I found following topics {code:java} __consumer_offsets dr1.checkpoints.internal dr1.test1 heartbeats mm2-configs.dr1.internal mm2-offsets.dr1.internal mm2-status.dr1.internal {code} But couldn't find any consumer groups on dr2 related to consumer group *test1grp*. Could you please let me know in detail how to migrate consumer group *test1grp* from dr1 to dr2?, i.e. what command(s) need to run to set up the offset for *test1grp* on dr2 before consume topic *dr1.test1* ? By the way, how to set up and run this in a Kafka connect cluster? i.e., how to set up MirrorSourceConnector, MirrorCheckpointConnector in a connect cluster? Is there document about this? was (Author: qihong): [~ryannedolan] Thanks for all your efforts on MM2. Appreciate your input on the following questions. I'm trying to replicate topics from one cluster to another, include topic data and related consumers' offset. But I only see the topic data was replicated, not consumer offsets. Here's my mm2.properties {code:java} clusters = dr1, dr2 dr1.bootstrap.servers = 10.1.0.4:9092,10.1.0.5:9092,10.1.0.6:9092 dr2.bootstrap.servers = 10.2.0.4:9092,10.2.0.5:9092,10.2.0.6:9092 # only allow replication dr1 -> dr2 dr1->dr2.enabled = true dr1->dr2.topics = test.* dr1->dr2.groups = test.* dr1->dr2.emit.heartbeats.enabled = false dr2->dr1.enabled = false dr2->dr1.emit.heartbeats.enabled = false {code} Here's how I started MM2 cluster (dr2 as the nearby cluster) {code:java} nohup bin/connect-mirror-maker.sh mm2.properties --clusters dr2 > mm2.log 2>&1 & {code} On dr1, there's topic *test1*, and consumer group *test1grp* for topic test1. On dr2, I found following topics {code:java} __consumer_offsets dr1.checkpoints.internal dr1.test1 heartbeats mm2-configs.dr1.internal mm2-offsets.dr1.internal mm2-status.dr1.internal {code} But couldn't find any consumer groups on dr2 related consumer group *test1grp*. Could you please let me know in detail how to migrate consumer group *test1grp* from dr1 to dr2?, i.e. what command(s) need to run to set up the offset for *test1grp* on dr2 before consume topic *dr1.test1* ? By the way, how to set up and run this in a Kafka connect cluster? i.e., how to set up MirrorSourceConnector, MirrorCheckpointConnector in a connect cluster? Is there document about this? > MirrorMaker 2.0 (KIP-382) > - > > Key: KAFKA-7500 > URL: https://issues.apache.org/jira/browse/KAFKA-7500 > Project: Kafka > Issue Type: New Feature > Components: KafkaConnect, mirrormaker >Affects Versions: 2.4.0 >Reporter: Ryanne Dolan >Assignee: Manikumar >Priority: Major > Labels: pull-request-available, ready-to-commit > Fix For: 2.4.0 > > Attachments: Active-Active XDCR setup.png > > > Implement a drop-in replacement for MirrorMaker leveraging the Connect > framework. > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0] > [https://github.com/apache/kafka/pull/6295] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (GEODE-137) Spark Connector: should connect to local GemFire server if possible
Qihong Chen created GEODE-137: - Summary: Spark Connector: should connect to local GemFire server if possible Key: GEODE-137 URL: https://issues.apache.org/jira/browse/GEODE-137 Project: Geode Issue Type: Bug Reporter: Qihong Chen Assignee: Qihong Chen DefaultGemFireConnection uses ClientCacheFactory with locator info to create ClientCache instance. In this case, the ClientCache doesn't connect to the GemFire/Geode server on the same host if there's one. This cause more network traffic and less efficient. ClientCacheFactory can create ClientCache based on GemFire server(s) info as well. Therefore, we can force the ClientCache connects to local GemFire server if possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (GEODE-120) RDD.saveToGemfire() can not handle big dataset (1M entries per partition)
[ https://issues.apache.org/jira/browse/GEODE-120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qihong Chen resolved GEODE-120. --- Resolution: Fixed RDD.saveToGemfire() can not handle big dataset (1M entries per partition) - Key: GEODE-120 URL: https://issues.apache.org/jira/browse/GEODE-120 Project: Geode Issue Type: Sub-task Components: core, extensions Affects Versions: 1.0.0-incubating Reporter: Qihong Chen Assignee: Qihong Chen Original Estimate: 48h Remaining Estimate: 48h the connector use single region.putAll() call to save each RDD partition. But putAll() doesn't handle big dataset well (such as 1M record). Need to split the dataset into smaller chunks, and invoke putAll() for each chunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (GEODE-114) There's race condition in DefaultGemFireConnection.getRegionProxy
[ https://issues.apache.org/jira/browse/GEODE-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qihong Chen closed GEODE-114. - There's race condition in DefaultGemFireConnection.getRegionProxy - Key: GEODE-114 URL: https://issues.apache.org/jira/browse/GEODE-114 Project: Geode Issue Type: Sub-task Components: core, extensions Affects Versions: 1.0.0-incubating Reporter: Qihong Chen Assignee: Qihong Chen Fix For: 1.0.0-incubating Original Estimate: 24h Remaining Estimate: 24h when multiple threads try to call getRegionProxy with the same region at the same time, the following exception was thrown: com.gemstone.gemfire.cache.RegionExistsException: /debs at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:2880) at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2835) at com.gemstone.gemfire.cache.client.internal.ClientRegionFactoryImpl.create(ClientRegionFactoryImpl.java:223) at io.pivotal.gemfire.spark.connector.internal.DefaultGemFireConnection.getRegionProxy(DefaultGemFireConnection.scala:87) at io.pivotal.gemfire.spark.connector.internal.rdd.GemFirePairRDDWriter.write(GemFireRDDWriter.scala:47) at io.pivotal.gemfire.spark.connector.GemFirePairRDDFunctions$$anonfun$saveToGemfire$2.apply(GemFirePairRDDFunctions.scala:24) at io.pivotal.gemfire.spark.connector.GemFirePairRDDFunctions$$anonfun$saveToGemfire$2.apply(GemFirePairRDDFunctions.scala:24) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (GEODE-120) RDD.saveToGemfire() can not handle big dataset (1M entries per partition)
[ https://issues.apache.org/jira/browse/GEODE-120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qihong Chen updated GEODE-120: -- Summary: RDD.saveToGemfire() can not handle big dataset (1M entries per partition) (was: RDD.saveToGemfire() can not handle big dataset (1M record per partition)) RDD.saveToGemfire() can not handle big dataset (1M entries per partition) - Key: GEODE-120 URL: https://issues.apache.org/jira/browse/GEODE-120 Project: Geode Issue Type: Sub-task Components: core, extensions Affects Versions: 1.0.0-incubating Reporter: Qihong Chen Assignee: Qihong Chen Original Estimate: 48h Remaining Estimate: 48h the connector use single region.putAll() call to save each RDD partition. But putAll() doesn't handle big dataset well (such as 1M record). Need to split the dataset into smaller chunks, and invoke putAll() for each chunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)