How to repartition Spark DStream Kafka ConsumerRecord RDD.

Alchemist Fri, 28 Sep 2018 02:13:38 -0700
 How to repartition Spark DStream Kafka ConsumerRecord RDD.  I am getting 
uneven size of Kafka topics.. We want to repartition the input RDD based on 
some logic.
 But when I try to apply the repartition I am getting "object not serializable 
(class: org.apache.kafka.clients.consumer.ConsumerRecord" error, I found 
following workaround
 
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html
Call rdd.forEachPartition and create the NotSerializable object in there like 
this:rdd.forEachPartition(iter -> {  NotSerializable notSerializable = new 
NotSerializable();
  // ...Now process iter});
APPLIED HERE
 val stream =KafkaUtils.createDirectStream[String, String]( ssc, 
PreferConsistent, Subscribe[String, String](topics, kafkaParam) 
).map(_.value())      stream.foreachRDD { rdd =>        val repartitionRDD = 
flow.repartitionRDD(rdd,1)        println("&&&&&&&&&&&&&& repartitionRDD " + 
repartitionRDD.count())       val modifiedRDD = rdd.mapPartitions {           
iter =>{            val customerRecords: List[ConsumerRecord[String, String]] = 
List[ConsumerRecord[String, String]]()             while(iter.hasNext){         
         val consumerRecord :ConsumerRecord[String, String] = iter.next()       
           customerRecords:+ consumerRecord             }             
customerRecords.iterator          }        }        val r = 
modifiedRDD.repartition(1)        println("************* after repartition " + 
r.count())
BUT still getting same object not Serializable error.   Any help is greatly 
appreciated.
How to repartition Spark DStream Kafka ConsumerRecord RDD.

Reply via email to