Please post a minimal complete code example of what you are talking about On Thu, Dec 15, 2016 at 6:00 PM, Michael Nguyen <nguyen.m.mich...@gmail.com> wrote: > I have the following sequence of Spark Java API calls (Spark 2.0.2): > > Kafka stream that is processed via a map function, which returns the string > value from tuple2._2() for JavaDStream as in > > return tuple2._2(); > > The returned JavaDStream is then processed by foreachPartition, which is > wrapped by foreachRDD. > > foreachPartition's call function does Iterator on the RDD as in > inputRDD.next (); > > When data is received, step 1 is executed, which is correct. However, > inputRDD.next () in step 3 makes a duplicate call to the map function in > step 1. So that map function is called twice for every message: > > - the first time when the message is received from the Kafka stream, and > > - the second time when Iterator inputParams.next () is invoked from > foreachPartition's call function. > > I also tried transforming the data in the map function as in > > public TestTransformedClass call(Tuple2<String, String> tuple2) for step 1 > > public void call(Iterator<TestTransformedClass> inputParams) for step 3 > > and the same issue occurs. So this issue occurs, no matter whether this > sequence of Spark API calls involves data transformation or not. > > Questions: > > Since the message was already processed in step 1, why does inputRDD.next () > in step 3 makes a duplicate call to the map function in step 1 ? > > How do I fix it to avoid duplicate invocation for every message ? > > Thanks.
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org