Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
Let me be more detailed in my response: Kafka works on “at least once” semantics. Therefore, given your assumption that Kafka "will be operational", we can assume that at least once semantics will hold. At this point, it comes down to designing for consumer (really Spark Executor) resilience.

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
At that scale, it’s best not to do coordination at the application layer. How much of your data is transactional in nature {all, some, none}? By which I mean ACID-compliant. > On Apr 19, 2016, at 10:53 AM, Erwan ALLAIN wrote: > > Cody, you're right that was an

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
Cody, you're right that was an example. Target architecture would be 3 DCs :) Good point on ZK, I'll have to check that. About Spark, both instances will run at the same time but on different topics. That would be quite useless to have to 2DCs working on the same set of data. I just want, in case

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Cody Koeninger
Maybe I'm missing something, but I don't see how you get a quorum in only 2 datacenters (without splitbrain problem, etc). I also don't know how well ZK will work cross-datacenter. As far as the spark side of things goes, if it's idempotent, why not just run both instances all the time. On

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
I'm describing a disaster recovery but it can be used to make one datacenter offline for upgrade for instance. >From my point of view when DC2 crashes: *On Kafka side:* - kafka cluster will lose one or more broker (partition leader and replica) - partition leader lost will be reelected in the

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
It the main concern uptime or disaster recovery? > On Apr 19, 2016, at 9:12 AM, Cody Koeninger wrote: > > I think the bigger question is what happens to Kafka and your downstream data > store when DC2 crashes. > > From a Spark point of view, starting up a post-crash job in

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Cody Koeninger
I think the bigger question is what happens to Kafka and your downstream data store when DC2 crashes. >From a Spark point of view, starting up a post-crash job in a new data center isn't really different from starting up a post-crash job in the original data center. On Tue, Apr 19, 2016 at 3:32

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
Thanks Jason and Cody. I'll try to explain a bit better the Multi DC case. As I mentionned before, I'm planning to use one kafka cluster and 2 or more spark cluster distinct. Let's say we have the following DCs configuration in a nominal case. Kafka partitions are consumed uniformly by the 2

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Cody Koeninger
The current direct stream only handles exactly the partitions specified at startup. You'd have to restart the job if you changed partitions. https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work towards using the kafka 0.10 consumer, which would allow for dynamic topicparittions

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Jason Nerothin
Hi Erwan, You might consider InsightEdge: http://insightedge.io . It has the capability of doing WAN between data grids and would save you the work of having to re-invent the wheel. Additionally, RDDs can be shared between developers in the same DC. Thanks, Jason >

Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Erwan ALLAIN
Hello, I'm currently designing a solution where 2 distinct clusters Spark (2 datacenters) share the same Kafka (Kafka rack aware or manual broker repartition). The aims are - preventing DC crash: using kafka resiliency and consumer group mechanism (or else ?) - keeping consistent offset among