The API signature would of course be more general (sorry!):

Given a RDD of elements of type T, an initial state of type S and a map
function (S,T) -> (S,U), return an RDD of Us obtained by applying the
map function in sequence, updating the state as elements are mapped.

With this formulation, zipWithIndex would be a special case of
mapWithState (so it could be refactored to be expressed as such).

Antonin

On 24/05/2020 10:58, Antonin Delpeuch (lists) wrote:
> Hi,
> 
> Spark Streaming has a `mapWithState` API to run a map on a stream while
> maintaining a state as elements are read.
> 
> The core RDD API does not seem to have anything similar. Given a RDD of
> elements of type T, an initial state of type S and a map function (S,T)
> -> (S,T), return an RDD of Ts obtained by applying the map function in
> sequence, updating the state as elements are mapped.
> 
> There seems to be some interest on the user mailing list for this:
> http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-td10968.html
> The solution suggested there is to use mapPartitions, but that does not
> make it possible to share the state from one partition to another.
> 
> I am thinking a proper mapWithState could be implemented with a
> dedicated RDD along the lines of what zipWithIndex does. When the RDD is
> created, run a job to compute the state at partition boundaries. Then,
> store those states in the partitions returned, which lets you iterate
> these partitions independently, starting from the stored state.
> 
> Obviously I can do this in my own project with a custom RDD. Would there
> be appetite to have this in Spark itself?
> 
> Cheers,
> Antonin
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to