Re: Kafka - streaming from multiple topics

2015-12-21 Thread Cody Koeninger
different kind of transformations, I am not sure joining them
>>>> would be really efficient, unless you know something that I don't.
>>>>
>>>> As I really don't need any interaction between those streams, I think I
>>>> might end up running 3 different streaming apps instead of one.
>>>>
>>>> Thanks again!
>>>>
>>>> On Thu, Dec 17, 2015 at 11:43 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> Using spark.streaming.concurrentJobs for this probably isn't a good
>>>>> idea, as it allows the next batch to start processing before current one 
>>>>> is
>>>>> finished, which may have unintended consequences.
>>>>>
>>>>> Why can't you use a single stream with all the topics you care about,
>>>>> or multiple streams if you're e.g. joining them?
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com> wrote:
>>>>>
>>>>>> Nevermind, I found the answer to my questions.
>>>>>> The following spark configuration property will allow you to process
>>>>>> multiple KafkaDirectStream in parallel:
>>>>>> --conf spark.streaming.concurrentJobs=
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com <http://nabble.com>.
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> jean-pierre ocalan
>>>> jpoca...@gmail.com
>>>>
>>>
>>>
>>
>


Re: Kafka - streaming from multiple topics

2015-12-21 Thread Neelesh
gt; switch on the topic for that partition.  It's probably cleaner to do
>>>> separate jobs, just depends on how you want to organize your code.
>>>>
>>>> On Thu, Dec 17, 2015 at 11:11 AM, Jean-Pierre OCALAN <
>>>> jpoca...@gmail.com> wrote:
>>>>
>>>>> Hi Cody,
>>>>>
>>>>> First of all thanks for the note about spark.streaming.concurrentJobs.
>>>>> I guess this is why it's not mentioned in the actual spark streaming doc.
>>>>> Since those 3 topics contain completely different data on which I need
>>>>> to apply different kind of transformations, I am not sure joining them
>>>>> would be really efficient, unless you know something that I don't.
>>>>>
>>>>> As I really don't need any interaction between those streams, I think
>>>>> I might end up running 3 different streaming apps instead of one.
>>>>>
>>>>> Thanks again!
>>>>>
>>>>> On Thu, Dec 17, 2015 at 11:43 AM, Cody Koeninger <c...@koeninger.org>
>>>>> wrote:
>>>>>
>>>>>> Using spark.streaming.concurrentJobs for this probably isn't a good
>>>>>> idea, as it allows the next batch to start processing before current one 
>>>>>> is
>>>>>> finished, which may have unintended consequences.
>>>>>>
>>>>>> Why can't you use a single stream with all the topics you care about,
>>>>>> or multiple streams if you're e.g. joining them?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com> wrote:
>>>>>>
>>>>>>> Nevermind, I found the answer to my questions.
>>>>>>> The following spark configuration property will allow you to process
>>>>>>> multiple KafkaDirectStream in parallel:
>>>>>>> --conf spark.streaming.concurrentJobs=
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> jean-pierre ocalan
>>>>> jpoca...@gmail.com
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Kafka - streaming from multiple topics

2015-12-21 Thread Cody Koeninger
;>> have completed. Effectively, a lightly loaded topic gets the worst deal
>>>> because of a heavily loaded topic
>>>>
>>>> Is my understanding correct?
>>>>
>>>>
>>>>
>>>> On Thu, Dec 17, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> You could stick them all in a single stream, and do mapPartitions,
>>>>> then switch on the topic for that partition.  It's probably cleaner to do
>>>>> separate jobs, just depends on how you want to organize your code.
>>>>>
>>>>> On Thu, Dec 17, 2015 at 11:11 AM, Jean-Pierre OCALAN <
>>>>> jpoca...@gmail.com> wrote:
>>>>>
>>>>>> Hi Cody,
>>>>>>
>>>>>> First of all thanks for the note about
>>>>>> spark.streaming.concurrentJobs. I guess this is why it's not mentioned in
>>>>>> the actual spark streaming doc.
>>>>>> Since those 3 topics contain completely different data on which I
>>>>>> need to apply different kind of transformations, I am not sure joining 
>>>>>> them
>>>>>> would be really efficient, unless you know something that I don't.
>>>>>>
>>>>>> As I really don't need any interaction between those streams, I think
>>>>>> I might end up running 3 different streaming apps instead of one.
>>>>>>
>>>>>> Thanks again!
>>>>>>
>>>>>> On Thu, Dec 17, 2015 at 11:43 AM, Cody Koeninger <c...@koeninger.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Using spark.streaming.concurrentJobs for this probably isn't a good
>>>>>>> idea, as it allows the next batch to start processing before current 
>>>>>>> one is
>>>>>>> finished, which may have unintended consequences.
>>>>>>>
>>>>>>> Why can't you use a single stream with all the topics you care
>>>>>>> about, or multiple streams if you're e.g. joining them?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Nevermind, I found the answer to my questions.
>>>>>>>> The following spark configuration property will allow you to process
>>>>>>>> multiple KafkaDirectStream in parallel:
>>>>>>>> --conf spark.streaming.concurrentJobs=
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>
>>>>>>>>
>>>>>>>> -
>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> jean-pierre ocalan
>>>>>> jpoca...@gmail.com
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Kafka - streaming from multiple topics

2015-12-20 Thread Chris Fregly
separating out your code into separate streaming jobs - especially when there 
are no dependencies between the jobs - is almost always the best route.  it's 
easier to combine atoms (fusion), then split them (fission).

I recommend splitting out jobs along batch window, stream window, and 
state-tracking characteristics.

for example, imagine 3 separate jobs for the following:

1) light storing of raw data into Cassandra (500ms batch interval)
2) medium aggregations/window roll ups (2000ms batch interval)
3) heavy training a ML model (1ms batch interval). 

and reminder that you can control (isolate or combine) the spark resources used 
by these separate, single purpose streaming jobs using scheduler pools just 
like your batch spark jobs.

@cody:  curious about neelesh's question, as well.  does the Kafka Direct 
Stream API treat each Kafka Topic Partition separately in terms of parallel 
retrieval?

more context:  within a Kafka Topic partition, Kafka guarantees order, but not 
total ordering across partitions.  this is normal and expected.

so I assume the the Kafka Direct Streaming connector can retrieve (and 
recover/retry) from separate partitions in parallel and still maintain the 
ordering guarantees offered by Kafka.

if this is true, then I'd suggest @neelesh create more partitions within the 
Kafka Topic to improve parallelism - same as any distributed, partitioned data 
processing engine including spark.

if this is not true, is there a technical limitation to prevent this 
parallelism within the connector?  

> On Dec 19, 2015, at 5:51 PM, Neelesh <neele...@gmail.com> wrote:
> 
> A related issue -  When I put multiple topics in a single stream, the 
> processing delay is as bad as the slowest task in the number of tasks 
> created. Even though the topics are unrelated to each other, RDD at time "t1" 
> has to wait for the RDD at "t0"  is fully executed,  even if most cores are 
> idling, and  just one task is still running and the rest of them have 
> completed. Effectively, a lightly loaded topic gets the worst deal because of 
> a heavily loaded topic
> 
> Is my understanding correct? 
> 
> 
> 
>> On Thu, Dec 17, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org> wrote:
>> You could stick them all in a single stream, and do mapPartitions, then 
>> switch on the topic for that partition.  It's probably cleaner to do 
>> separate jobs, just depends on how you want to organize your code.
>> 
>>> On Thu, Dec 17, 2015 at 11:11 AM, Jean-Pierre OCALAN <jpoca...@gmail.com> 
>>> wrote:
>>> Hi Cody,
>>> 
>>> First of all thanks for the note about spark.streaming.concurrentJobs. I 
>>> guess this is why it's not mentioned in the actual spark streaming doc.
>>> Since those 3 topics contain completely different data on which I need to 
>>> apply different kind of transformations, I am not sure joining them would 
>>> be really efficient, unless you know something that I don't.
>>> 
>>> As I really don't need any interaction between those streams, I think I 
>>> might end up running 3 different streaming apps instead of one.
>>> 
>>> Thanks again!
>>> 
>>>> On Thu, Dec 17, 2015 at 11:43 AM, Cody Koeninger <c...@koeninger.org> 
>>>> wrote:
>>>> Using spark.streaming.concurrentJobs for this probably isn't a good idea, 
>>>> as it allows the next batch to start processing before current one is 
>>>> finished, which may have unintended consequences.
>>>> 
>>>> Why can't you use a single stream with all the topics you care about, or 
>>>> multiple streams if you're e.g. joining them?
>>>> 
>>>> 
>>>> 
>>>>> On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com> wrote:
>>>>> Nevermind, I found the answer to my questions.
>>>>> The following spark configuration property will allow you to process
>>>>> multiple KafkaDirectStream in parallel:
>>>>> --conf spark.streaming.concurrentJobs=
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> View this message in context: 
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html
>>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>> 
>>>>> -
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>> 
>>> 
>>> 
>>> -- 
>>> jean-pierre ocalan
>>> jpoca...@gmail.com
> 


Re: Kafka - streaming from multiple topics

2015-12-20 Thread Neelesh
@Chris,
There is a 1-1 mapping b/w spark partitions & kafka partitions out of
the box . One can break it by repartitioning of course and add more
parallelism, but that has its own issues around consumer offset management-
 when do I commit the offsets, for example. While its trivial to increase
the number of Kafka partitions to achieve higher parallelism, it is
inherently static and cannot solve the problem of traffic bursts.  If it
were possible to repartition data from a single kafka partition and still
be able to handle consumer offset management when all child partitions that
make up the kafka partition succeeded/failed, it would solve the problem.

@Cody, would love to hear your thoughts around this.

On Sun, Dec 20, 2015 at 8:37 AM, Chris Fregly <ch...@fregly.com> wrote:

> separating out your code into separate streaming jobs - especially when
> there are no dependencies between the jobs - is almost always the best
> route.  it's easier to combine atoms (fusion), then split them (fission).
>
> I recommend splitting out jobs along batch window, stream window, and
> state-tracking characteristics.
>
> for example, imagine 3 separate jobs for the following:
>
> 1) light storing of raw data into Cassandra (500ms batch interval)
> 2) medium aggregations/window roll ups (2000ms batch interval)
> 3) heavy training a ML model (1ms batch interval).
>
> and reminder that you can control (isolate or combine) the spark resources
> used by these separate, single purpose streaming jobs using scheduler pools
> just like your batch spark jobs.
>
> @cody:  curious about neelesh's question, as well.  does the Kafka Direct
> Stream API treat each Kafka Topic Partition separately in terms of parallel
> retrieval?
>
> more context:  within a Kafka Topic partition, Kafka guarantees order, but
> not total ordering across partitions.  this is normal and expected.
>
> so I assume the the Kafka Direct Streaming connector can retrieve (and
> recover/retry) from separate partitions in parallel and still maintain the
> ordering guarantees offered by Kafka.
>
> if this is true, then I'd suggest @neelesh create more partitions within
> the Kafka Topic to improve parallelism - same as any distributed,
> partitioned data processing engine including spark.
>
> if this is not true, is there a technical limitation to prevent this
> parallelism within the connector?
>
> On Dec 19, 2015, at 5:51 PM, Neelesh <neele...@gmail.com> wrote:
>
> A related issue -  When I put multiple topics in a single stream, the
> processing delay is as bad as the slowest task in the number of tasks
> created. Even though the topics are unrelated to each other, RDD at time
> "t1" has to wait for the RDD at "t0"  is fully executed,  even if most
> cores are idling, and  just one task is still running and the rest of them
> have completed. Effectively, a lightly loaded topic gets the worst deal
> because of a heavily loaded topic
>
> Is my understanding correct?
>
>
>
> On Thu, Dec 17, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> You could stick them all in a single stream, and do mapPartitions, then
>> switch on the topic for that partition.  It's probably cleaner to do
>> separate jobs, just depends on how you want to organize your code.
>>
>> On Thu, Dec 17, 2015 at 11:11 AM, Jean-Pierre OCALAN <jpoca...@gmail.com>
>> wrote:
>>
>>> Hi Cody,
>>>
>>> First of all thanks for the note about spark.streaming.concurrentJobs. I
>>> guess this is why it's not mentioned in the actual spark streaming doc.
>>> Since those 3 topics contain completely different data on which I need
>>> to apply different kind of transformations, I am not sure joining them
>>> would be really efficient, unless you know something that I don't.
>>>
>>> As I really don't need any interaction between those streams, I think I
>>> might end up running 3 different streaming apps instead of one.
>>>
>>> Thanks again!
>>>
>>> On Thu, Dec 17, 2015 at 11:43 AM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> Using spark.streaming.concurrentJobs for this probably isn't a good
>>>> idea, as it allows the next batch to start processing before current one is
>>>> finished, which may have unintended consequences.
>>>>
>>>> Why can't you use a single stream with all the topics you care about,
>>>> or multiple streams if you're e.g. joining them?
>>>>
>>>>
>>>>
>>>> On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com>

Re: Kafka - streaming from multiple topics

2015-12-19 Thread Neelesh
A related issue -  When I put multiple topics in a single stream, the
processing delay is as bad as the slowest task in the number of tasks
created. Even though the topics are unrelated to each other, RDD at time
"t1" has to wait for the RDD at "t0"  is fully executed,  even if most
cores are idling, and  just one task is still running and the rest of them
have completed. Effectively, a lightly loaded topic gets the worst deal
because of a heavily loaded topic

Is my understanding correct?



On Thu, Dec 17, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org> wrote:

> You could stick them all in a single stream, and do mapPartitions, then
> switch on the topic for that partition.  It's probably cleaner to do
> separate jobs, just depends on how you want to organize your code.
>
> On Thu, Dec 17, 2015 at 11:11 AM, Jean-Pierre OCALAN <jpoca...@gmail.com>
> wrote:
>
>> Hi Cody,
>>
>> First of all thanks for the note about spark.streaming.concurrentJobs. I
>> guess this is why it's not mentioned in the actual spark streaming doc.
>> Since those 3 topics contain completely different data on which I need to
>> apply different kind of transformations, I am not sure joining them would
>> be really efficient, unless you know something that I don't.
>>
>> As I really don't need any interaction between those streams, I think I
>> might end up running 3 different streaming apps instead of one.
>>
>> Thanks again!
>>
>> On Thu, Dec 17, 2015 at 11:43 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> Using spark.streaming.concurrentJobs for this probably isn't a good
>>> idea, as it allows the next batch to start processing before current one is
>>> finished, which may have unintended consequences.
>>>
>>> Why can't you use a single stream with all the topics you care about, or
>>> multiple streams if you're e.g. joining them?
>>>
>>>
>>>
>>> On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com> wrote:
>>>
>>>> Nevermind, I found the answer to my questions.
>>>> The following spark configuration property will allow you to process
>>>> multiple KafkaDirectStream in parallel:
>>>> --conf spark.streaming.concurrentJobs=
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>> --
>> jean-pierre ocalan
>> jpoca...@gmail.com
>>
>
>


Re: Kafka - streaming from multiple topics

2015-12-17 Thread Cody Koeninger
Using spark.streaming.concurrentJobs for this probably isn't a good idea,
as it allows the next batch to start processing before current one is
finished, which may have unintended consequences.

Why can't you use a single stream with all the topics you care about, or
multiple streams if you're e.g. joining them?



On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com> wrote:

> Nevermind, I found the answer to my questions.
> The following spark configuration property will allow you to process
> multiple KafkaDirectStream in parallel:
> --conf spark.streaming.concurrentJobs=
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Kafka - streaming from multiple topics

2015-12-17 Thread Jean-Pierre OCALAN
Hi Cody,

First of all thanks for the note about spark.streaming.concurrentJobs. I
guess this is why it's not mentioned in the actual spark streaming doc.
Since those 3 topics contain completely different data on which I need to
apply different kind of transformations, I am not sure joining them would
be really efficient, unless you know something that I don't.

As I really don't need any interaction between those streams, I think I
might end up running 3 different streaming apps instead of one.

Thanks again!

On Thu, Dec 17, 2015 at 11:43 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Using spark.streaming.concurrentJobs for this probably isn't a good idea,
> as it allows the next batch to start processing before current one is
> finished, which may have unintended consequences.
>
> Why can't you use a single stream with all the topics you care about, or
> multiple streams if you're e.g. joining them?
>
>
>
> On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com> wrote:
>
>> Nevermind, I found the answer to my questions.
>> The following spark configuration property will allow you to process
>> multiple KafkaDirectStream in parallel:
>> --conf spark.streaming.concurrentJobs=
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
jean-pierre ocalan
jpoca...@gmail.com


Re: Kafka - streaming from multiple topics

2015-12-17 Thread Cody Koeninger
You could stick them all in a single stream, and do mapPartitions, then
switch on the topic for that partition.  It's probably cleaner to do
separate jobs, just depends on how you want to organize your code.

On Thu, Dec 17, 2015 at 11:11 AM, Jean-Pierre OCALAN <jpoca...@gmail.com>
wrote:

> Hi Cody,
>
> First of all thanks for the note about spark.streaming.concurrentJobs. I
> guess this is why it's not mentioned in the actual spark streaming doc.
> Since those 3 topics contain completely different data on which I need to
> apply different kind of transformations, I am not sure joining them would
> be really efficient, unless you know something that I don't.
>
> As I really don't need any interaction between those streams, I think I
> might end up running 3 different streaming apps instead of one.
>
> Thanks again!
>
> On Thu, Dec 17, 2015 at 11:43 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> Using spark.streaming.concurrentJobs for this probably isn't a good idea,
>> as it allows the next batch to start processing before current one is
>> finished, which may have unintended consequences.
>>
>> Why can't you use a single stream with all the topics you care about, or
>> multiple streams if you're e.g. joining them?
>>
>>
>>
>> On Wed, Dec 16, 2015 at 3:00 PM, jpocalan <jpoca...@gmail.com> wrote:
>>
>>> Nevermind, I found the answer to my questions.
>>> The following spark configuration property will allow you to process
>>> multiple KafkaDirectStream in parallel:
>>> --conf spark.streaming.concurrentJobs=
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> jean-pierre ocalan
> jpoca...@gmail.com
>


Re: Kafka - streaming from multiple topics

2015-12-16 Thread jpocalan
Nevermind, I found the answer to my questions.
The following spark configuration property will allow you to process
multiple KafkaDirectStream in parallel:
--conf spark.streaming.concurrentJobs=





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25723.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka - streaming from multiple topics

2015-12-03 Thread Cody Koeninger
Yeah, that general plan should work, but might be a little awkward for
adding topicPartitions after the fact (i.e. when you have stored offsets
for some, but not all, of your topicpartitions)

Personally I just query kafka for the starting offsets if they dont exist
in the DB, using the methods in KafkaCluster.scala.

Yes, if you don't include a starting offset for a particular partition, it
will be ignored.

On Thu, Dec 3, 2015 at 3:31 PM, Dan Dutrow <dan.dut...@gmail.com> wrote:

> Hey Cody, I'm convinced that I'm not going to get the functionality I want
> without using the Direct Stream API.
>
> I'm now looking through
> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md#exactly-once-using-transactional-writes
> where you say "For the very first time the job is run, the table can be
> pre-loaded with appropriate starting offsets."
>
> Could you provide some guidance on how to determine valid starting offsets
> the very first time, particularly in my case where I have 10+ topics in
> multiple different deployment environments with an unknown and potentially
> dynamic number of partitions per topic per environment?
>
> I'd be happy if I could initialize all consumers to the value of 
> *auto.offset.reset
> = "largest"*, record the partitions and offsets as they flow through
> spark, and then use those discovered offsets from thereon out.
>
> I'm thinking I can probably just do some if/else logic and use the basic
> createDirectStream and the more advanced
> createDirectStream(...fromOffsets...) if the offsets for my topic name
> exists in the database. Any reason that wouldn't work? If I don't include
> an offset range for a particular partition, will that partition be ignored?
>
>
>
>
> On Wed, Dec 2, 2015 at 3:17 PM Cody Koeninger <c...@koeninger.org> wrote:
>
>> Use the direct stream.  You can put multiple topics in a single stream,
>> and differentiate them on a per-partition basis using the offset range.
>>
>> On Wed, Dec 2, 2015 at 2:13 PM, dutrow <dan.dut...@gmail.com> wrote:
>>
>>> I found the JIRA ticket:
>>> https://issues.apache.org/jira/browse/SPARK-2388
>>>
>>> It was marked as invalid.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25550.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>> --
> Dan ✆
>


Re: Kafka - streaming from multiple topics

2015-12-03 Thread Dan Dutrow
Hey Cody, I'm convinced that I'm not going to get the functionality I want
without using the Direct Stream API.

I'm now looking through
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md#exactly-once-using-transactional-writes
where you say "For the very first time the job is run, the table can be
pre-loaded with appropriate starting offsets."

Could you provide some guidance on how to determine valid starting offsets
the very first time, particularly in my case where I have 10+ topics in
multiple different deployment environments with an unknown and potentially
dynamic number of partitions per topic per environment?

I'd be happy if I could initialize all consumers to the value of
*auto.offset.reset
= "largest"*, record the partitions and offsets as they flow through spark,
and then use those discovered offsets from thereon out.

I'm thinking I can probably just do some if/else logic and use the basic
createDirectStream and the more advanced
createDirectStream(...fromOffsets...) if the offsets for my topic name
exists in the database. Any reason that wouldn't work? If I don't include
an offset range for a particular partition, will that partition be ignored?




On Wed, Dec 2, 2015 at 3:17 PM Cody Koeninger <c...@koeninger.org> wrote:

> Use the direct stream.  You can put multiple topics in a single stream,
> and differentiate them on a per-partition basis using the offset range.
>
> On Wed, Dec 2, 2015 at 2:13 PM, dutrow <dan.dut...@gmail.com> wrote:
>
>> I found the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2388
>>
>> It was marked as invalid.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25550.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>> --
Dan ✆


Re: Kafka - streaming from multiple topics

2015-12-02 Thread Cody Koeninger
Use the direct stream.  You can put multiple topics in a single stream, and
differentiate them on a per-partition basis using the offset range.

On Wed, Dec 2, 2015 at 2:13 PM, dutrow <dan.dut...@gmail.com> wrote:

> I found the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2388
>
> It was marked as invalid.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25550.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Kafka - streaming from multiple topics

2015-12-02 Thread dutrow
My need is similar; I have 10+ topics and don't want to dedicate 10 cores to
processing all of them. Like yourself and others, the (String, String) pair
that comes out of the DStream has (null, StringData...) values instead of
(topic name, StringData...)

Did anyone ever find a way around this issue?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25549.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka - streaming from multiple topics

2015-12-02 Thread Dan Dutrow
Sigh... I want to use the direct stream and have recently brought in Redis
to persist the offsets, but I really like and need to have realtime metrics
on the GUI, so I'm hoping to have Direct and Receiver stream both working.

On Wed, Dec 2, 2015 at 3:17 PM Cody Koeninger <c...@koeninger.org> wrote:

> Use the direct stream.  You can put multiple topics in a single stream,
> and differentiate them on a per-partition basis using the offset range.
>
> On Wed, Dec 2, 2015 at 2:13 PM, dutrow <dan.dut...@gmail.com> wrote:
>
>> I found the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2388
>>
>> It was marked as invalid.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25550.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>> --
Dan ✆


Re: Kafka - streaming from multiple topics

2015-12-02 Thread dutrow
I found the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2388

It was marked as invalid.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25550.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka - streaming from multiple topics

2014-08-13 Thread maddenpj
Can you link to the JIRA issue? I'm having to work around this bug and it
would be nice to monitor the JIRA so I can change my code when it's fixed. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p12053.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kafka - streaming from multiple topics

2014-07-07 Thread Sergey Malov
I opened JIRA issue with Spark, as an improvement though, not as a bug. 
Hopefully, someone there would notice it.

From: Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Date: Thursday, July 3, 2014 at 9:41 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Kafka - streaming from multiple topics

Sergey,


On Fri, Jul 4, 2014 at 1:06 AM, Sergey Malov 
sma...@collective.commailto:sma...@collective.com wrote:
On the other hand, under the hood KafkaInputDStream which is create with this 
KafkaUtils call,  calls ConsumerConnector.createMessageStream which returns a 
Map[String, List[KafkaStream] keyed by topic. It is, however, not exposed.

I wonder if this is a bug. After all, KafkaUtils.createStream() returns a 
DStream[(String, String)], which pretty much looks like it should be a (topic 
- message) mapping. However, for me, the key is always null. Maybe you could 
consider filing a bug/wishlist report?

Tobias



Re: Kafka - streaming from multiple topics

2014-07-03 Thread Sergey Malov
That’s an obvious workaround, yes, thank you Tobias.
However, I’m prototyping substitution to real batch process, where I’d have to 
create six streams (and possibly more).  It could be a bit messy.
On the other hand, under the hood KafkaInputDStream which is create with this 
KafkaUtils call,  calls ConsumerConnector.createMessageStream which returns a 
Map[String, List[KafkaStream] keyed by topic. It is, however, not exposed.
So Kafka does provide capability of creating multiple streams based on topic, 
but Spark doesn’t use it, which is unfortunate.

Sergey

From: Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Date: Wednesday, July 2, 2014 at 9:54 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Kafka - streaming from multiple topics

Sergey,

you might actually consider using two streams, like
  val stream1 = KafkaUtils.createStream(ssc,localhost:2181,logs, 
Map(retarget - 2))
  val stream2 = KafkaUtils.createStream(ssc,localhost:2181,logs, 
Map(datapair - 2))
to achieve what you want. This has the additional advantage that there are 
actually two connections to Kafka and data is possibly received on different 
cluster nodes, already increasing parallelity in an early stage of processing.

Tobias



On Thu, Jul 3, 2014 at 6:47 AM, Sergey Malov 
sma...@collective.commailto:sma...@collective.com wrote:
HI,
I would like to set up streaming from Kafka cluster, reading multiple topics 
and then processing each of the differently.
So, I’d create a stream

  val stream = KafkaUtils.createStream(ssc,localhost:2181,logs, 
Map(retarget - 2,datapair - 2))

And then based on whether it’s “retarget” topic or “datapair”, set up different 
filter function, map function, reduce function, etc. Is it possible ?  I’d 
assume it should be, since ConsumerConnector can map of KafkaStreams keyed on 
topic, but I can’t find that it would be visible to Spark.

Thank you,

Sergey Malov




Re: Kafka - streaming from multiple topics

2014-07-03 Thread Tobias Pfeiffer
Sergey,


On Fri, Jul 4, 2014 at 1:06 AM, Sergey Malov sma...@collective.com wrote:

 On the other hand, under the hood KafkaInputDStream which is create with
 this KafkaUtils call,  calls ConsumerConnector.createMessageStream which
 returns a Map[String, List[KafkaStream] keyed by topic. It is, however, not
 exposed.


I wonder if this is a bug. After all, KafkaUtils.createStream() returns a
DStream[(String, String)], which pretty much looks like it should be a
(topic - message) mapping. However, for me, the key is always null. Maybe
you could consider filing a bug/wishlist report?

Tobias


Kafka - streaming from multiple topics

2014-07-02 Thread Sergey Malov
HI,
I would like to set up streaming from Kafka cluster, reading multiple topics 
and then processing each of the differently.
So, I’d create a stream

  val stream = KafkaUtils.createStream(ssc,localhost:2181,logs, 
Map(retarget - 2,datapair - 2))

And then based on whether it’s “retarget” topic or “datapair”, set up different 
filter function, map function, reduce function, etc. Is it possible ?  I’d 
assume it should be, since ConsumerConnector can map of KafkaStreams keyed on 
topic, but I can’t find that it would be visible to Spark.

Thank you,

Sergey Malov