[jira] [Commented] (KAFKA-7500) MirrorMaker 2.0 (KIP-382)

2019-09-18 Thread Qihong Chen (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932616#comment-16932616
 ] 

Qihong Chen commented on KAFKA-7500:


Hi [~ryannedolan], Thanks for your quick response (y)(y)(y)

Now I know what "primary" means. According to blog [A look inside Kafka 
MirrorMaker2|https://blog.cloudera.com/a-look-inside-kafka-mirrormaker-2/] by 
Renu Tewari,
{quote}In MM2 only one connect cluster is needed for all the cross-cluster 
replications between a pair of datacenters. Now if we simply take a Kafka 
source and sink connector and deploy them in tandem to do replication, the data 
would need to hop through an intermediate Kafka cluster. MM2 avoids this 
unnecessary data copying by a direct passthrough from source to sink.  
{quote}
 This is exactly what I want! Do you have release schedule for *SinkConnector*, 
and *direct passthrough from source to sink* feature?

 

> MirrorMaker 2.0 (KIP-382)
> -
>
> Key: KAFKA-7500
> URL: https://issues.apache.org/jira/browse/KAFKA-7500
> Project: Kafka
>  Issue Type: New Feature
>  Components: KafkaConnect, mirrormaker
>Affects Versions: 2.4.0
>Reporter: Ryanne Dolan
>Assignee: Manikumar
>Priority: Major
>  Labels: pull-request-available, ready-to-commit
> Fix For: 2.4.0
>
> Attachments: Active-Active XDCR setup.png
>
>
> Implement a drop-in replacement for MirrorMaker leveraging the Connect 
> framework.
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0]
> [https://github.com/apache/kafka/pull/6295]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-7500) MirrorMaker 2.0 (KIP-382)

2019-09-17 Thread Qihong Chen (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931743#comment-16931743
 ] 

Qihong Chen edited comment on KAFKA-7500 at 9/17/19 7:22 PM:
-

Hi [~ryannedolan], just listened your Kafka Power Chat talk, Thanks!

Have a follow up question about the last question (from me) you answered in the 
talk. You said you prefer dedicated MM2 cluster over running MM2 in connect 
cluster since you can use less number of clusters to do replications among 
multiple Kafka clusters.

But there's no REST Api for a dedicated MM2 cluster that can provide the status 
of the replication streams, nor updating the replication configuration. Any 
changes to the configuration meaning update the config files and restart all 
MM2 instances, is that right? Or I missed it that dedicated MM2 cluster does 
provide REST API for admin and monitoring, if so, where is it?

If my understanding is correct, we can archive the same thing with MM2 in 
connect cluster. Assume there are 3 Kafka clusters: A, B, and C. Set up a 
connect cluster against C (meaning all topics for connectors' data and states 
go to cluster C), then set up MM2 connectors to replicate data and metadata  A 
-> B and B -> A. If this is correct, we can use the Kafka cluster C plus the 
connect cluster that running against Kafka cluster C to replicate data among 
more Kafka clusters, like A, B, and D, and even more. Of course, this needs 
more complicated configuration, which requires deeper understanding how the MM2 
connectors work. In this scenario, the connect cluster provides REST API to 
admin and monitoring all the connectors. This will be useful for people can't 
use Stream Replication Manager from Cloudera or Kafka replicator from Confluent 
for some reason. Is this right?


was (Author: qihong):
Hi [~ryannedolan], just listened your Kafka Power Chat talk, Thanks!

Have a follow up question about the last question (from me) you answered in the 
talk. You said you prefer dedicated MM2 cluster over running MM2 in connect 
cluster since you can use less number of clusters to do replications among 
multiple Kafka clusters.

But there's no REST Api for a dedicated MM2 cluster that can provide the status 
of the replication streams, nor updating the replication configuration. Any 
changes to the configuration meaning update the config files and restart all 
MM2 instances, is that right? Or I missed it that dedicated MM2 cluster does 
provide REST API for admin and monitoring, if so, where is it?

If my understanding is correct, we can archive the same thing with MM2 in 
connect cluster. Assume there are 3 Kafka clusters: A, B, and C. Set up a 
connect cluster against C (meaning all topics for connectors' data and states 
go to cluster C), then set up MM2 connectors to replicate data and metadata  A 
-> B and B -> A. If this is correct, we can use the Kafka cluster C plus the 
connect cluster that running against Kafka cluster C to replicate data among 
more Kafka clusters, like A, B, and D, and even more. Of course, this needs 
more complicated configuration, which requires deeper understanding how the MM2 
connectors work. In this scenario, the connect cluster provides REST API to 
admin and monitoring all the connectors. This will be useful for people can't 
use Stream Replication Manager from Cloudera or Kafka replicator from Confluent 
for some reason. Is this right?

 

 

 

> MirrorMaker 2.0 (KIP-382)
> -
>
> Key: KAFKA-7500
> URL: https://issues.apache.org/jira/browse/KAFKA-7500
> Project: Kafka
>  Issue Type: New Feature
>  Components: KafkaConnect, mirrormaker
>Affects Versions: 2.4.0
>Reporter: Ryanne Dolan
>Assignee: Manikumar
>Priority: Major
>  Labels: pull-request-available, ready-to-commit
> Fix For: 2.4.0
>
> Attachments: Active-Active XDCR setup.png
>
>
> Implement a drop-in replacement for MirrorMaker leveraging the Connect 
> framework.
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0]
> [https://github.com/apache/kafka/pull/6295]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (KAFKA-7500) MirrorMaker 2.0 (KIP-382)

2019-09-17 Thread Qihong Chen (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931743#comment-16931743
 ] 

Qihong Chen commented on KAFKA-7500:


Hi [~ryannedolan], just listened your Kafka Power Chat talk, Thanks!

Have a follow up question about the last question (from me) you answered in the 
talk. You said you prefer dedicated MM2 cluster over running MM2 in connect 
cluster since you can use less number of clusters to do replications among 
multiple Kafka clusters.

But there's no REST Api for a dedicated MM2 cluster that can provide the status 
of the replication streams, nor updating the replication configuration. Any 
changes to the configuration meaning update the config files and restart all 
MM2 instances, is that right? Or I missed it that dedicated MM2 cluster does 
provide REST API for admin and monitoring, if so, where is it?

If my understanding is correct, we can archive the same thing with MM2 in 
connect cluster. Assume there are 3 Kafka clusters: A, B, and C. Set up a 
connect cluster against C (meaning all topics for connectors' data and states 
go to cluster C), then set up MM2 connectors to replicate data and metadata  A 
-> B and B -> A. If this is correct, we can use the Kafka cluster C plus the 
connect cluster that running against Kafka cluster C to replicate data among 
more Kafka clusters, like A, B, and D, and even more. Of course, this needs 
more complicated configuration, which requires deeper understanding how the MM2 
connectors work. In this scenario, the connect cluster provides REST API to 
admin and monitoring all the connectors. This will be useful for people can't 
use Stream Replication Manager from Cloudera or Kafka replicator from Confluent 
for some reason. Is this right?

 

 

 

> MirrorMaker 2.0 (KIP-382)
> -
>
> Key: KAFKA-7500
> URL: https://issues.apache.org/jira/browse/KAFKA-7500
> Project: Kafka
>  Issue Type: New Feature
>  Components: KafkaConnect, mirrormaker
>Affects Versions: 2.4.0
>Reporter: Ryanne Dolan
>Assignee: Manikumar
>Priority: Major
>  Labels: pull-request-available, ready-to-commit
> Fix For: 2.4.0
>
> Attachments: Active-Active XDCR setup.png
>
>
> Implement a drop-in replacement for MirrorMaker leveraging the Connect 
> framework.
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0]
> [https://github.com/apache/kafka/pull/6295]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (KAFKA-7500) MirrorMaker 2.0 (KIP-382)

2019-09-11 Thread Qihong Chen (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927863#comment-16927863
 ] 

Qihong Chen edited comment on KAFKA-7500 at 9/11/19 6:13 PM:
-

[~ryannedolan] Thanks for all your efforts on MM2. Appreciate your input on the 
following questions.

I'm trying to replicate topics from one cluster to another, include topic data 
and related consumers' offset. But I only see the topic data was replicated, 
not consumer offsets.

Here's my mm2.properties
{code:java}
clusters = dr1, dr2

dr1.bootstrap.servers = 10.1.0.4:9092,10.1.0.5:9092,10.1.0.6:9092
dr2.bootstrap.servers = 10.2.0.4:9092,10.2.0.5:9092,10.2.0.6:9092

# only allow replication dr1 -> dr2
dr1->dr2.enabled = true
dr1->dr2.topics = test.*
dr1->dr2.groups = test.*
dr1->dr2.emit.heartbeats.enabled = false

dr2->dr1.enabled = false
dr2->dr1.emit.heartbeats.enabled = false
{code}
Here's how I started MM2 cluster (dr2 as the nearby cluster)
{code:java}
nohup bin/connect-mirror-maker.sh mm2.properties --clusters dr2 > mm2.log 2>&1 &
{code}
On dr1, there's topic *test1*, and consumer group *test1grp* for topic test1.

On dr2, I found following topics
{code:java}
__consumer_offsets
dr1.checkpoints.internal
dr1.test1
heartbeats
mm2-configs.dr1.internal
mm2-offsets.dr1.internal
mm2-status.dr1.internal
 {code}
But couldn't find any consumer groups on dr2 related to consumer group 
*test1grp*.

Could you please let me know in detail how to migrate consumer group *test1grp* 
from dr1 to dr2?, i.e. what command(s) need to run to set up the offset for 
*test1grp* on dr2 before consume topic *dr1.test1* ?

 

By the way, how to set up and run this in a Kafka connect cluster? i.e., how to 
set up 
 MirrorSourceConnector, MirrorCheckpointConnector in a connect cluster? Is 
there document about this?
  


was (Author: qihong):
[~ryannedolan] Thanks for all your efforts on MM2. Appreciate your input on the 
following questions.

I'm trying to replicate topics from one cluster to another, include topic data 
and related consumers' offset. But I only see the topic data was replicated, 
not consumer offsets.

Here's my mm2.properties
{code:java}
clusters = dr1, dr2

dr1.bootstrap.servers = 10.1.0.4:9092,10.1.0.5:9092,10.1.0.6:9092
dr2.bootstrap.servers = 10.2.0.4:9092,10.2.0.5:9092,10.2.0.6:9092

# only allow replication dr1 -> dr2
dr1->dr2.enabled = true
dr1->dr2.topics = test.*
dr1->dr2.groups = test.*
dr1->dr2.emit.heartbeats.enabled = false

dr2->dr1.enabled = false
dr2->dr1.emit.heartbeats.enabled = false
{code}
Here's how I started MM2 cluster (dr2 as the nearby cluster)
{code:java}
nohup bin/connect-mirror-maker.sh mm2.properties --clusters dr2 > mm2.log 2>&1 &
{code}
On dr1, there's topic *test1*, and consumer group *test1grp* for topic test1.

On dr2, I found following topics
{code:java}
__consumer_offsets
dr1.checkpoints.internal
dr1.test1
heartbeats
mm2-configs.dr1.internal
mm2-offsets.dr1.internal
mm2-status.dr1.internal
 {code}
But couldn't find any consumer groups on dr2 related consumer group *test1grp*.


 Could you please let me know in detail how to migrate consumer group 
*test1grp* from dr1 to dr2?, i.e. what command(s) need to run to set up the 
offset for *test1grp* on dr2 before consume topic *dr1.test1* ?

 

By the way, how to set up and run this in a Kafka connect cluster? i.e., how to 
set up 
 MirrorSourceConnector, MirrorCheckpointConnector in a connect cluster? Is 
there document about this?
  

> MirrorMaker 2.0 (KIP-382)
> -
>
> Key: KAFKA-7500
> URL: https://issues.apache.org/jira/browse/KAFKA-7500
> Project: Kafka
>  Issue Type: New Feature
>  Components: KafkaConnect, mirrormaker
>Affects Versions: 2.4.0
>Reporter: Ryanne Dolan
>Assignee: Manikumar
>Priority: Major
>  Labels: pull-request-available, ready-to-commit
> Fix For: 2.4.0
>
> Attachments: Active-Active XDCR setup.png
>
>
> Implement a drop-in replacement for MirrorMaker leveraging the Connect 
> framework.
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0]
> [https://github.com/apache/kafka/pull/6295]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (GEODE-137) Spark Connector: should connect to local GemFire server if possible

2015-07-17 Thread Qihong Chen (JIRA)
Qihong Chen created GEODE-137:
-

 Summary: Spark Connector: should connect to local GemFire server 
if possible
 Key: GEODE-137
 URL: https://issues.apache.org/jira/browse/GEODE-137
 Project: Geode
  Issue Type: Bug
Reporter: Qihong Chen
Assignee: Qihong Chen


DefaultGemFireConnection uses ClientCacheFactory with locator info to create 
ClientCache instance. In this case, the ClientCache doesn't connect to the 
GemFire/Geode server on the same host if there's one. This cause more network 
traffic and less efficient.

ClientCacheFactory can create ClientCache based on GemFire server(s) info as 
well. Therefore, we can force the ClientCache connects to local GemFire server 
if possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (GEODE-120) RDD.saveToGemfire() can not handle big dataset (1M entries per partition)

2015-07-17 Thread Qihong Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEODE-120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qihong Chen resolved GEODE-120.
---
Resolution: Fixed

 RDD.saveToGemfire() can not handle big dataset (1M entries per partition)
 -

 Key: GEODE-120
 URL: https://issues.apache.org/jira/browse/GEODE-120
 Project: Geode
  Issue Type: Sub-task
  Components: core, extensions
Affects Versions: 1.0.0-incubating
Reporter: Qihong Chen
Assignee: Qihong Chen
   Original Estimate: 48h
  Remaining Estimate: 48h

 the connector use single region.putAll() call to save each RDD partition. But 
 putAll() doesn't  handle big dataset well (such as 1M record). Need to split 
 the dataset into smaller chunks, and invoke putAll() for each chunk. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (GEODE-114) There's race condition in DefaultGemFireConnection.getRegionProxy

2015-07-17 Thread Qihong Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEODE-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qihong Chen closed GEODE-114.
-

 There's race condition in DefaultGemFireConnection.getRegionProxy
 -

 Key: GEODE-114
 URL: https://issues.apache.org/jira/browse/GEODE-114
 Project: Geode
  Issue Type: Sub-task
  Components: core, extensions
Affects Versions: 1.0.0-incubating
Reporter: Qihong Chen
Assignee: Qihong Chen
 Fix For: 1.0.0-incubating

   Original Estimate: 24h
  Remaining Estimate: 24h

 when multiple threads try to call getRegionProxy with the same region at the 
 same time, the following exception was thrown:
 com.gemstone.gemfire.cache.RegionExistsException: /debs
 at 
 com.gemstone.gemfire.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:2880)
 at 
 com.gemstone.gemfire.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2835)
 at 
 com.gemstone.gemfire.cache.client.internal.ClientRegionFactoryImpl.create(ClientRegionFactoryImpl.java:223)
 at 
 io.pivotal.gemfire.spark.connector.internal.DefaultGemFireConnection.getRegionProxy(DefaultGemFireConnection.scala:87)
 at 
 io.pivotal.gemfire.spark.connector.internal.rdd.GemFirePairRDDWriter.write(GemFireRDDWriter.scala:47)
 at 
 io.pivotal.gemfire.spark.connector.GemFirePairRDDFunctions$$anonfun$saveToGemfire$2.apply(GemFirePairRDDFunctions.scala:24)
 at 
 io.pivotal.gemfire.spark.connector.GemFirePairRDDFunctions$$anonfun$saveToGemfire$2.apply(GemFirePairRDDFunctions.scala:24)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (GEODE-120) RDD.saveToGemfire() can not handle big dataset (1M entries per partition)

2015-07-16 Thread Qihong Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEODE-120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qihong Chen updated GEODE-120:
--
Summary: RDD.saveToGemfire() can not handle big dataset (1M entries per 
partition)  (was: RDD.saveToGemfire() can not handle big dataset (1M record per 
partition))

 RDD.saveToGemfire() can not handle big dataset (1M entries per partition)
 -

 Key: GEODE-120
 URL: https://issues.apache.org/jira/browse/GEODE-120
 Project: Geode
  Issue Type: Sub-task
  Components: core, extensions
Affects Versions: 1.0.0-incubating
Reporter: Qihong Chen
Assignee: Qihong Chen
   Original Estimate: 48h
  Remaining Estimate: 48h

 the connector use single region.putAll() call to save each RDD partition. But 
 putAll() doesn't  handle big dataset well (such as 1M record). Need to split 
 the dataset into smaller chunks, and invoke putAll() for each chunk. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)