Re: Kafka - streaming from multiple topics

2014-07-07 Thread Sergey Malov
I opened JIRA issue with Spark, as an improvement though, not as a bug. 
Hopefully, someone there would notice it.

From: Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Date: Thursday, July 3, 2014 at 9:41 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Kafka - streaming from multiple topics

Sergey,


On Fri, Jul 4, 2014 at 1:06 AM, Sergey Malov 
sma...@collective.commailto:sma...@collective.com wrote:
On the other hand, under the hood KafkaInputDStream which is create with this 
KafkaUtils call,  calls ConsumerConnector.createMessageStream which returns a 
Map[String, List[KafkaStream] keyed by topic. It is, however, not exposed.

I wonder if this is a bug. After all, KafkaUtils.createStream() returns a 
DStream[(String, String)], which pretty much looks like it should be a (topic 
- message) mapping. However, for me, the key is always null. Maybe you could 
consider filing a bug/wishlist report?

Tobias



Re: Kafka - streaming from multiple topics

2014-07-03 Thread Sergey Malov
That’s an obvious workaround, yes, thank you Tobias.
However, I’m prototyping substitution to real batch process, where I’d have to 
create six streams (and possibly more).  It could be a bit messy.
On the other hand, under the hood KafkaInputDStream which is create with this 
KafkaUtils call,  calls ConsumerConnector.createMessageStream which returns a 
Map[String, List[KafkaStream] keyed by topic. It is, however, not exposed.
So Kafka does provide capability of creating multiple streams based on topic, 
but Spark doesn’t use it, which is unfortunate.

Sergey

From: Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Date: Wednesday, July 2, 2014 at 9:54 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Kafka - streaming from multiple topics

Sergey,

you might actually consider using two streams, like
  val stream1 = KafkaUtils.createStream(ssc,localhost:2181,logs, 
Map(retarget - 2))
  val stream2 = KafkaUtils.createStream(ssc,localhost:2181,logs, 
Map(datapair - 2))
to achieve what you want. This has the additional advantage that there are 
actually two connections to Kafka and data is possibly received on different 
cluster nodes, already increasing parallelity in an early stage of processing.

Tobias



On Thu, Jul 3, 2014 at 6:47 AM, Sergey Malov 
sma...@collective.commailto:sma...@collective.com wrote:
HI,
I would like to set up streaming from Kafka cluster, reading multiple topics 
and then processing each of the differently.
So, I’d create a stream

  val stream = KafkaUtils.createStream(ssc,localhost:2181,logs, 
Map(retarget - 2,datapair - 2))

And then based on whether it’s “retarget” topic or “datapair”, set up different 
filter function, map function, reduce function, etc. Is it possible ?  I’d 
assume it should be, since ConsumerConnector can map of KafkaStreams keyed on 
topic, but I can’t find that it would be visible to Spark.

Thank you,

Sergey Malov




Kafka - streaming from multiple topics

2014-07-02 Thread Sergey Malov
HI,
I would like to set up streaming from Kafka cluster, reading multiple topics 
and then processing each of the differently.
So, I’d create a stream

  val stream = KafkaUtils.createStream(ssc,localhost:2181,logs, 
Map(retarget - 2,datapair - 2))

And then based on whether it’s “retarget” topic or “datapair”, set up different 
filter function, map function, reduce function, etc. Is it possible ?  I’d 
assume it should be, since ConsumerConnector can map of KafkaStreams keyed on 
topic, but I can’t find that it would be visible to Spark.

Thank you,

Sergey Malov