Thanks for the reply. Unfortunately, in my case, the binary file is a mix of short and long integers. Is there any other way that could of use here?
My current method happens to have a large overhead (much more than actual computation time). Also, I am short of memory at the driver when it has to read the entire file. On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman <freeman.jer...@gmail.com> wrote: > If it’s a flat binary file and each record is the same length (in bytes), > you can use Spark’s binaryRecords method (defined on the SparkContext), > which loads records from one or more large flat binary files into an RDD. > Here’s an example in python to show how it works: > > # write data from an array > from numpy import random > dat = random.randn(100,5) > f = open('test.bin', 'w') > f.write(dat) > f.close() > > > # load the data back in > > from numpy import frombuffer > > nrecords = 5 > bytesize = 8 > recordsize = nrecords * bytesize > data = sc.binaryRecords('test.bin', recordsize) > parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float')) > > > # these should be equal > parsed.first() > dat[0,:] > > > Does that help? > > ------------------------- > jeremyfreeman.net > @thefreemanlab > > On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote: > > What are some efficient ways to read a large file into RDDs? > > For example, have several executors read a specific/unique portion of the > file and construct RDDs. Is this possible to do in Spark? > > Currently, I am doing a line-by-line read of the file at the driver and > constructing the RDD. > > >