The API signature would of course be more general (sorry!): Given a RDD of elements of type T, an initial state of type S and a map function (S,T) -> (S,U), return an RDD of Us obtained by applying the map function in sequence, updating the state as elements are mapped.
With this formulation, zipWithIndex would be a special case of mapWithState (so it could be refactored to be expressed as such). Antonin On 24/05/2020 10:58, Antonin Delpeuch (lists) wrote: > Hi, > > Spark Streaming has a `mapWithState` API to run a map on a stream while > maintaining a state as elements are read. > > The core RDD API does not seem to have anything similar. Given a RDD of > elements of type T, an initial state of type S and a map function (S,T) > -> (S,T), return an RDD of Ts obtained by applying the map function in > sequence, updating the state as elements are mapped. > > There seems to be some interest on the user mailing list for this: > http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-td10968.html > The solution suggested there is to use mapPartitions, but that does not > make it possible to share the state from one partition to another. > > I am thinking a proper mapWithState could be implemented with a > dedicated RDD along the lines of what zipWithIndex does. When the RDD is > created, run a job to compute the state at partition boundaries. Then, > store those states in the partitions returned, which lets you iterate > these partitions independently, starting from the stored state. > > Obviously I can do this in my own project with a custom RDD. Would there > be appetite to have this in Spark itself? > > Cheers, > Antonin > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org