Hi all, Today I was contributing my (limited) Scala knowledge to some local folks that are working on setting up a Spark Streaming cluster with Kafka. Bear with me, I'm fairly new to Spark.
One of the issues they are facing is they they observe all kafka traffic going to only one node of the cluster. The reason may lie behind this note [0]. Further on, I was looking into KafkaInputDStream and I stumbled on the way the dataBlocks are being populated: There are n threads running n message handlers, each listening to a specific kafka topic stream. They all seem to share the same reference of the blockGenerator and contribute data using the BlockGenerator.+= method [1], which does not offer any synchronization. Unless I missed some part, It looks like we might have a concurrency issue when multiple kafka message handlers add data to the underlying ArrayBuffer in the BlockGenerator. I continued my journey through the system and ended at NetworkInputTracker [2]. It mixes two concurrency models: The Actor-based NetworkInputStreamActor and the 'raw' thread-based ReceiverExecutor. Where the NetworkInputTracker safely shares a queue with the actor, at the expense of some locking. >From a newbie perspective, it looks like using a single concurrency model (actors?) would provide a uniform programming paradigm and simplify some of the code. But it's evident that the author, with knowledge of both models had made specific choices for a reason. Care to comment about the rationale behind mixing these concurrency models? Any suggestions on what should stay and what might be candidate for refactoring? Thanks! -kr, Gerard. [0] https://github.com/maasg/spark/blob/master/streaming/src/main/scala/spark/streaming/NetworkInputTracker.scala#L161 [1] https://github.com/maasg/spark/blob/master/streaming/src/main/scala/spark/streaming/dstream/KafkaInputDStream.scala#L119 [2] https://github.com/maasg/spark/blob/master/streaming/src/main/scala/spark/streaming/NetworkInputTracker.scala