Thanks for the reply. Unfortunately, in my case, the binary file is a mix
of short and long integers. Is there any other way that could of use here?

My current method happens to have a large overhead (much more than actual
computation time). Also, I am short of memory at the driver when it has to
read the entire file.

On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman <freeman.jer...@gmail.com>
wrote:

> If it’s a flat binary file and each record is the same length (in bytes),
> you can use Spark’s binaryRecords method (defined on the SparkContext),
> which loads records from one or more large flat binary files into an RDD.
> Here’s an example in python to show how it works:
>
> # write data from an array
> from numpy import random
> dat = random.randn(100,5)
> f = open('test.bin', 'w')
> f.write(dat)
> f.close()
>
>
> # load the data back in
>
> from numpy import frombuffer
>
> nrecords = 5
> bytesize = 8
> recordsize = nrecords * bytesize
> data = sc.binaryRecords('test.bin', recordsize)
> parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))
>
>
> # these should be equal
> parsed.first()
> dat[0,:]
>
>
> Does that help?
>
> -------------------------
> jeremyfreeman.net
> @thefreemanlab
>
> On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan <kvi...@vt.edu> wrote:
>
> What are some efficient ways to read a large file into RDDs?
>
> For example, have several executors read a specific/unique portion of the
> file and construct RDDs. Is this possible to do in Spark?
>
> Currently, I am doing a line-by-line read of the file at the driver and
> constructing the RDD.
>
>
>

Reply via email to