Multi DC Spark

Erwan ALLAIN Tue, 19 Apr 2016 08:53:56 -0700

Cody, you're right that was an example. Target architecture would be 3 DCs
:) Good point on ZK, I'll have to check that.


About Spark, both instances will run at the same time but on different
topics. That would be quite useless to have to 2DCs working on the same set
of data.
I just want, in case of crash, that the healthy spark works on all topics
(retrieve dead spark load).

Does it seem an awkward design ?

On Tue, Apr 19, 2016 at 5:38 PM, Cody Koeninger <c...@koeninger.org> wrote:

> Maybe I'm missing something, but I don't see how you get a quorum in only
> 2 datacenters (without splitbrain problem, etc).  I also don't know how
> well ZK will work cross-datacenter.
>
> As far as the spark side of things goes, if it's idempotent, why not just
> run both instances all the time.
>
>
>
> On Tue, Apr 19, 2016 at 10:21 AM, Erwan ALLAIN <eallain.po...@gmail.com>
> wrote:
>
>> I'm describing a disaster recovery but it can be used to make one
>> datacenter offline for upgrade for instance.
>>
>> From my point of view when DC2 crashes:
>>
>> *On Kafka side:*
>> - kafka cluster will lose one or more broker (partition leader and
>> replica)
>> - partition leader lost will be reelected in the remaining healthy DC
>>
>> => if the number of in-sync replicas are above the minimum threshold,
>> kafka should be operational
>>
>> *On downstream datastore side (say Cassandra for instance):*
>> - deploy accross the 2 DCs in (QUORUM / QUORUM)
>> - idempotent write
>>
>> => it should be ok (depends on replication factor)
>>
>> *On Spark*:
>> - treatment should be idempotent, it will allow us to restart from the
>> last commited offset
>>
>> I understand that starting up a post crash job would work.
>>
>> Question is: how can we detect when DC2 crashes to start a new job ?
>>
>> dynamic topic partition (at each kafkaRDD creation for instance) + topic
>> subscription may be the answer ?
>>
>> I appreciate your effort.
>>
>> On Tue, Apr 19, 2016 at 4:25 PM, Jason Nerothin <jasonnerot...@gmail.com>
>> wrote:
>>
>>> It the main concern uptime or disaster recovery?
>>>
>>> On Apr 19, 2016, at 9:12 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>>
>>> I think the bigger question is what happens to Kafka and your downstream
>>> data store when DC2 crashes.
>>>
>>> From a Spark point of view, starting up a post-crash job in a new data
>>> center isn't really different from starting up a post-crash job in the
>>> original data center.
>>>
>>> On Tue, Apr 19, 2016 at 3:32 AM, Erwan ALLAIN <eallain.po...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Jason and Cody. I'll try to explain a bit better the Multi DC
>>>> case.
>>>>
>>>> As I mentionned before, I'm planning to use one kafka cluster and 2 or
>>>> more spark cluster distinct.
>>>>
>>>> Let's say we have the following DCs configuration in a nominal case.
>>>> Kafka partitions are consumed uniformly by the 2 datacenters.
>>>>
>>>> DataCenter Spark Kafka Consumer Group Kafka partition (P1 to P4)
>>>> DC 1 Master 1.1
>>>>
>>>> Worker 1.1 my_group P1
>>>> Worker 1.2 my_group P2
>>>> DC 2 Master 2.1
>>>>
>>>> Worker 2.1 my_group P3
>>>> Worker 2.2 my_group P4
>>>> I would like, in case of DC crash, a rebalancing of partition on the
>>>> healthy DC, something as follow
>>>>
>>>> DataCenter Spark Kafka Consumer Group Kafka partition (P1 to P4)
>>>> DC 1 Master 1.1
>>>>
>>>> Worker 1.1 my_group P1*, P3*
>>>> Worker 1.2 my_group P2*, P4*
>>>> DC 2 Master 2.1
>>>>
>>>> Worker 2.1 my_group P3
>>>> Worker 2.2 my_group P4
>>>>
>>>> I would like to know if it's possible:
>>>> - using consumer group ?
>>>> - using direct approach ? I prefer this one as I don't want to activate
>>>> WAL.
>>>>
>>>> Hope the explanation is better !
>>>>
>>>>
>>>> On Mon, Apr 18, 2016 at 6:50 PM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> The current direct stream only handles exactly the partitions
>>>>> specified at startup.  You'd have to restart the job if you changed
>>>>> partitions.
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work
>>>>> towards using the kafka 0.10 consumer, which would allow for dynamic
>>>>> topicparittions
>>>>>
>>>>> Regarding your multi-DC questions, I'm not really clear on what you're
>>>>> saying.
>>>>>
>>>>> On Mon, Apr 18, 2016 at 11:18 AM, Erwan ALLAIN <
>>>>> eallain.po...@gmail.com> wrote:
>>>>> > Hello,
>>>>> >
>>>>> > I'm currently designing a solution where 2 distinct clusters Spark (2
>>>>> > datacenters) share the same Kafka (Kafka rack aware or manual broker
>>>>> > repartition).
>>>>> > The aims are
>>>>> > - preventing DC crash: using kafka resiliency and consumer group
>>>>> mechanism
>>>>> > (or else ?)
>>>>> > - keeping consistent offset among replica (vs mirror maker,which
>>>>> does not
>>>>> > keep offset)
>>>>> >
>>>>> > I have several questions
>>>>> >
>>>>> > 1) Dynamic repartition (one or 2 DC)
>>>>> >
>>>>> > I'm using KafkaDirectStream which map one partition kafka with one
>>>>> spark. Is
>>>>> > it possible to handle new or removed partition ?
>>>>> > In the compute method, it looks like we are always using the
>>>>> currentOffset
>>>>> > map to query the next batch and therefore it's always the same
>>>>> number of
>>>>> > partition ? Can we request metadata at each batch ?
>>>>> >
>>>>> > 2) Multi DC Spark
>>>>> >
>>>>> > Using Direct approach, a way to achieve this would be
>>>>> > - to "assign" (kafka 0.9 term) all topics to the 2 sparks
>>>>> > - only one is reading the partition (Check every x interval, "lock"
>>>>> stored
>>>>> > in cassandra for instance)
>>>>> >
>>>>> > => not sure if it works just an idea
>>>>> >
>>>>> > Using Consumer Group
>>>>> > - CommitOffset manually at the end of the batch
>>>>> >
>>>>> > => Does spark handle partition rebalancing ?
>>>>> >
>>>>> > I'd appreciate any ideas ! Let me know if it's not clear.
>>>>> >
>>>>> > Erwan
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

Reply via email to