So it worked quite well with a coalesce, I was able to find an solution to my problem : Altough not directly handling the executor a good roundaway was to assign the desired partition to a specific stream through assign strategy and coalesce to a single partition then repeat the same process for the remaining topics on different streams and at the end do a an union of these streams.
PS : No shuffle was made during the whole thing since the rdd partitions were collapsed to a single one Le 17 mars 2017 8:04 PM, "Michael Armbrust" <mich...@databricks.com> a écrit : > Another option that would avoid a shuffle would be to use assign and > coalesce, running two separate streams. > > spark.readStream > .format("kafka") > .option("kafka.bootstrap.servers", "...") > .option("assign", """{t0: {"0": xxxx}, t1:{"0": xxxxx}}""") > .load() > .coalesce(1) > .writeStream > .foreach(... code to write to cassandra ...) > > spark.readStream > .format("kafka") > .option("kafka.bootstrap.servers", "...") > .option("assign", """{t0: {"1": xxxx}, t1:{"1": xxxxx}}""") > .load() > .coalesce(1) > .writeStream > .foreach(... code to write to cassandra ...) > > On Fri, Mar 17, 2017 at 7:35 AM, OUASSAIDI, Sami <sami.ouassa...@mind7.fr> > wrote: > >> @Cody : Duly noted. >> @Michael Ambrust : A repartition is out of the question for our project >> as it would be a fairly expensive operation. We tried looking into >> targeting a specific executor so as to avoid this extra cost and directly >> have well partitioned data after consuming the kafka topics. Also we are >> using Spark streaming to save to the cassandra DB and try to keep shuffle >> operations to a strict minimum (at best none). As of now we are not >> entirely pleased with our current performances, that's why I'm doing a >> kafka topic sharding POC and getting the executor to handle the specificied >> partitions is central. >> ᐧ >> >> 2017-03-17 9:14 GMT+01:00 Michael Armbrust <mich...@databricks.com>: >> >>> Sorry, typo. Should be a repartition not a groupBy. >>> >>> >>>> spark.readStream >>>> .format("kafka") >>>> .option("kafka.bootstrap.servers", "...") >>>> .option("subscribe", "t0,t1") >>>> .load() >>>> .repartition($"partition") >>>> .writeStream >>>> .foreach(... code to write to cassandra ...) >>>> >>> >> >> >> -- >> *Mind7 Consulting* >> >> Sami Ouassaid | Consultant Big Data | sami.ouassa...@mind7.com >> __ >> >> 64 Rue Taitbout, 75009 Paris >> > >