Hi all. I've been working on a similar problem. One solution that is
straightforward (if suboptimal) is to do the following.

A.zipWithIndex().filter(_._2 >=range_start && _._2 < range_end). Lastly
just put that in a for loop. I've found that this approach scales very
well.

As Matei said another option is to define a custom partitioner and then use
mapPartitions. Hope that helps!


On Thu, Dec 11, 2014 at 6:16 PM Imran Rashid <im...@therashids.com> wrote:

> Minor correction:  I think you want iterator.grouped(10) for
> non-overlapping mini batches
> On Dec 11, 2014 1:37 PM, "Matei Zaharia" <matei.zaha...@gmail.com> wrote:
>
>> You can just do mapPartitions on the whole RDD, and then called sliding()
>> on the iterator in each one to get a sliding window. One problem is that
>> you will not be able to slide "forward" into the next partition at
>> partition boundaries. If this matters to you, you need to do something more
>> complicated to get those, such as the repartition that you said (where you
>> map each record to the partition it should be in).
>>
>> Matei
>>
>> > On Dec 11, 2014, at 10:16 AM, ll <duy.huynh....@gmail.com> wrote:
>> >
>> > any advice/comment on this would be much appreciated.
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini
>> -batches-tp20264p20635.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Reply via email to