Re: Failure handling

2017-01-25 Thread Erwan ALLAIN
xit ? It'll get > retried on another executor, but as long as that one fails the same > way... > > If you can identify the error case at the time you're doing database > interaction and just prevent data being written then, that's what I > typically do. > > On

Failure handling

2017-01-24 Thread Erwan ALLAIN
Hello guys, I have a question regarding how spark handle failure. I’m using kafka direct stream Spark 2.0.2 Kafka 0.10.0.1 Here is a snippet of code val stream = createDirectStream(….) stream .map(…) .forEachRDD( doSomething) stream .map(…) .forEachRDD( doSomethingElse) The execution is in

How to use logback

2016-11-28 Thread Erwan ALLAIN
Hello, In my project, I would like to use logback as logging framework ( faster, memory footprint, etc ...) I have managed to make it work however I had to modify the spark jars folder - remove slf4j-log4jxx.jar - add logback-classic / logback-core.jar And add logback.xml in conf folder. Is it th

Application config management

2016-11-09 Thread Erwan ALLAIN
Hi everyone, I d like to know what kind of configuration mechanism is used in general ? Below is what I m going to implement but I d like to know if there is any "standard way" 1) put configuration in hdfs 2) specify extrajavaoptions (driver and worker) with the hdfs url ( hdfs://ip:port/config)

Re: Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Erwan ALLAIN
ems from the map) > > > > On Fri, Oct 21, 2016 at 3:32 AM, Erwan ALLAIN > wrote: > > Hi, > > > > I'm currently implementing an exactly once mechanism based on the > following > > example: > > > > https://github.com/koeninger/kafka-exactly

Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Erwan ALLAIN
Hi, I'm currently implementing an exactly once mechanism based on the following example: https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala the pseudo code is as follow: dstream.transform (store offset in a variable on driver side ) ds

Slow Shuffle Operation on Empty Batch

2016-09-26 Thread Erwan ALLAIN
Hi I'm working with - Kafka 0.8.2 - Spark Streaming (2.0) direct input stream. - cassandra 3.0 My batch interval is 1s. When I use some map, filter even saveToCassandra functions, the processing time is around 50ms on empty batches => This is fine. As soon as I use some reduceByKey, the proces

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
ut splitbrain problem, etc). I also don't know how > well ZK will work cross-datacenter. > > As far as the spark side of things goes, if it's idempotent, why not just > run both instances all the time. > > > > On Tue, Apr 19, 2016 at 10:21 AM, Erwan ALLAIN > wr

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
hat happens to Kafka and your downstream > data store when DC2 crashes. > > From a Spark point of view, starting up a post-crash job in a new data > center isn't really different from starting up a post-crash job in the > original data center. > > On Tue, Apr 19, 2016 at 3:3

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
you changed > partitions. > > https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work > towards using the kafka 0.10 consumer, which would allow for dynamic > topicparittions > > Regarding your multi-DC questions, I'm not really clear on what you're &g

Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Erwan ALLAIN
Hello, I'm currently designing a solution where 2 distinct clusters Spark (2 datacenters) share the same Kafka (Kafka rack aware or manual broker repartition). The aims are - preventing DC crash: using kafka resiliency and consumer group mechanism (or else ?) - keeping consistent offset among repl

Re: Join and HashPartitioner question

2015-11-16 Thread Erwan ALLAIN
You may need to persist r1 after partitionBy call. second join will be more efficient. On Mon, Nov 16, 2015 at 2:48 PM, Rishi Mishra wrote: > AFAIK and can see in the code both of them should behave same. > > On Sat, Nov 14, 2015 at 2:10 AM, Alexander Pivovarov > wrote: > >> Hi Everyone >> >> I

Re: Saving offset while reading from kafka

2015-10-23 Thread Erwan ALLAIN
Have a look at this: https://github.com/koeninger/kafka-exactly-once especially: https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerPartiti

Re: Best practices to handle corrupted records

2015-10-16 Thread Erwan ALLAIN
ady looked into it and also at 'Try' from which I > got inspired. Thanks for pointing it out anyway! > > #A.M. > > Il giorno 15 ott 2015, alle ore 16:19, Erwan ALLAIN < > eallain.po...@gmail.com> ha scritto: > > What about http://www.scala-lang.org/api/2

Re: Best practices to handle corrupted records

2015-10-15 Thread Erwan ALLAIN
What about http://www.scala-lang.org/api/2.9.3/scala/Either.html ? On Thu, Oct 15, 2015 at 2:57 PM, Roberto Congiu wrote: > I came to a similar solution to a similar problem. I deal with a lot of > CSV files from many different sources and they are often malformed. > HOwever, I just have succes

Re: does KafkaCluster can be public ?

2015-10-07 Thread Erwan ALLAIN
ackwards >>> > compatibility, but if enough people ask for it... ? >>> > >>> > On Tue, Oct 6, 2015 at 6:51 AM, Jonathan Coveney >>> wrote: >>> >> >>> >> You can put a class in the org.apache.spark namespace to access >>

does KafkaCluster can be public ?

2015-10-06 Thread Erwan ALLAIN
have to do is almost the same as the KafkaCluster which is private. is it possible to : - add another signature in KafkaUtils ? - make KafkaCluster public ? or do you have any other srmart solution where I don't need to copy/paste KafkaCluster ? Thanks. Regards, Erwan ALLAIN