Hi all,

Today I was contributing my (limited) Scala knowledge to some local folks
that are working on setting up a Spark Streaming cluster with Kafka. Bear
with me, I'm fairly new to Spark.

One of the issues they are facing is they they observe all kafka traffic
going to only one node of the cluster. The reason may lie behind this note
[0].

Further on, I was looking into KafkaInputDStream and I stumbled on the way
the dataBlocks are being populated: There are n threads running n message
handlers, each listening to a specific kafka topic stream. They all seem to
share the same reference of the blockGenerator and contribute data using
the BlockGenerator.+= method [1], which does not offer any synchronization.

Unless I missed some part, It looks like we might have a concurrency issue
when multiple kafka message handlers add data to the underlying ArrayBuffer
in the BlockGenerator.

I continued my journey through the system and ended at NetworkInputTracker
[2]. It mixes two concurrency models: The Actor-based
NetworkInputStreamActor and the 'raw' thread-based ReceiverExecutor.  Where
the NetworkInputTracker safely shares a queue with the actor, at the
expense of some locking.

>From a newbie perspective, it looks like using a single concurrency model
(actors?) would provide a uniform programming paradigm and simplify some of
the code. But it's evident that the author, with knowledge of both models
had made specific choices for a reason.

Care to comment about the rationale behind mixing these concurrency models?
Any suggestions on what should stay and what might be candidate for
refactoring?

Thanks!

-kr, Gerard.

[0]
https://github.com/maasg/spark/blob/master/streaming/src/main/scala/spark/streaming/NetworkInputTracker.scala#L161
[1]
https://github.com/maasg/spark/blob/master/streaming/src/main/scala/spark/streaming/dstream/KafkaInputDStream.scala#L119
[2]
https://github.com/maasg/spark/blob/master/streaming/src/main/scala/spark/streaming/NetworkInputTracker.scala

Reply via email to