If it’s a flat binary file and each record is the same length (in bytes), you 
can use Spark’s binaryRecords method (defined on the SparkContext), which loads 
records from one or more large flat binary files into an RDD. Here’s an example 
in python to show how it works:

> # write data from an array
> from numpy import random
> dat = random.randn(100,5)
> f = open('test.bin', 'w')
> f.write(dat)
> f.close()

> # load the data back in
> from numpy import frombuffer
> nrecords = 5
> bytesize = 8
> recordsize = nrecords * bytesize
> data = sc.binaryRecords('test.bin', recordsize)
> parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))

> # these should be equal
> parsed.first()
> dat[0,:]


Does that help?

-------------------------
jeremyfreeman.net
@thefreemanlab

> On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote:
> 
> What are some efficient ways to read a large file into RDDs?
> 
> For example, have several executors read a specific/unique portion of the 
> file and construct RDDs. Is this possible to do in Spark?
> 
> Currently, I am doing a line-by-line read of the file at the driver and 
> constructing the RDD.

Reply via email to