Hi, Maybe you could use zipWithIndex and filter to skip the first elements. For example starting from
scala> sc.parallelize(100 to 120, 4).zipWithIndex.collect res12: Array[(Int, Long)] = Array((100,0), (101,1), (102,2), (103,3), (104,4), (105,5), (106,6), (107,7), (108,8), (109,9), (110,10), (111,11), (112,12), (113,13), (114,14), (115,15), (116,16), (117,17), (118,18), (119,19), (120,20)) we can get the 3 first elements starting from the 4th (counting from 0) as scala> sc.parallelize(100 to 120, 4).zipWithIndex.filter(_._2 >=4).take(3) res14: Array[(Int, Long)] = Array((104,4), (105,5), (106,6)) Hope that helps 2015-09-02 8:52 GMT+02:00 Hemant Bhanawat <hemant9...@gmail.com>: > I think rdd.toLocalIterator is what you want. But it will keep one > partition's data in-memory. > > On Wed, Sep 2, 2015 at 10:05 AM, Niranda Perera <niranda.per...@gmail.com> > wrote: > >> Hi all, >> >> I have a large set of data which would not fit into the memory. So, I wan >> to take n number of data from the RDD given a particular index. for an >> example, take 1000 rows starting from the index 1001. >> >> I see that there is a take(num: Int): Array[T] method in the RDD, but it >> only returns the 'first n number of rows'. >> >> the simplest use case of this, requirement is, say, I write a custom >> relation provider with a custom relation extending the InsertableRelation. >> >> say I submit this query, >> "insert into table abc select * from xyz sort by x asc" >> >> in my custom relation, I have implemented the def insert(data: DataFrame, >> overwrite: Boolean): Unit >> method. here, since the data is large, I can not call methods such as >> DataFrame.collect(). Instead, I could do, DataFrame.foreachpartition(...). >> As you could see, the resultant DF from the "select * from xyz sort by x >> asc" is sorted, and if I sun, foreachpartition on that DF and implement the >> insert method, this sorted order would be affected, since the inserting >> operation would be done in parallel in each partition. >> >> in order to handle this, my initial idea was to take rows from the RDD in >> batches and do the insert operation, and for that I was looking for a >> method to take n number of rows starting from a given index. >> >> is there any better way to handle this, in RDDs? >> >> your assistance in this regard is highly appreciated. >> >> cheers >> >> -- >> Niranda >> @n1r44 <https://twitter.com/N1R44> >> https://pythagoreanscript.wordpress.com/ >> > >