Hi all. I've been working on a similar problem. One solution that is straightforward (if suboptimal) is to do the following.
A.zipWithIndex().filter(_._2 >=range_start && _._2 < range_end). Lastly just put that in a for loop. I've found that this approach scales very well. As Matei said another option is to define a custom partitioner and then use mapPartitions. Hope that helps! On Thu, Dec 11, 2014 at 6:16 PM Imran Rashid <im...@therashids.com> wrote: > Minor correction: I think you want iterator.grouped(10) for > non-overlapping mini batches > On Dec 11, 2014 1:37 PM, "Matei Zaharia" <matei.zaha...@gmail.com> wrote: > >> You can just do mapPartitions on the whole RDD, and then called sliding() >> on the iterator in each one to get a sliding window. One problem is that >> you will not be able to slide "forward" into the next partition at >> partition boundaries. If this matters to you, you need to do something more >> complicated to get those, such as the repartition that you said (where you >> map each record to the partition it should be in). >> >> Matei >> >> > On Dec 11, 2014, at 10:16 AM, ll <duy.huynh....@gmail.com> wrote: >> > >> > any advice/comment on this would be much appreciated. >> > >> > >> > >> > -- >> > View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini >> -batches-tp20264p20635.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>