RE: Storing large data for MLlib machine learning

Ulanov, Alexander Thu, 26 Mar 2015 15:29:45 -0700

Thanks, Jeremy! I also work with time series data right now, so your 
suggestions are really relevant. However, we want to handle not the raw data, 
but already processed and prepared for machine learning.

Initially, we also wanted to have our own simple binary format, but we could 
not agree on handling little/big endian. We did not agree if we have to stick 
to a specific endian or to ship this information in metadata file. And metadata 
file sounds like another data format engineering (aka inventing the bicycle). 
Does this make sense to you?

From: Jeremy Freeman [mailto:freeman.jer...@gmail.com]
Sent: Thursday, March 26, 2015 3:01 PM
To: Ulanov, Alexander
Cc: Stephen Boesch; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

Hi Ulvanov, great question, we've encountered it frequently with scientific 
data (e.g. time series). Agreed text is inefficient for dense arrays, and we 
also found HDF5+Spark to be a pain.

Our strategy has been flat binary files with fixed length records. Loading 
these is now supported in Spark via the binaryRecords method, which wraps a 
custom Hadoop InputFormat we wrote.

An example (in python):

# write data from an array
from numpy import random
dat = random.randn(100,5)
f = open('test.bin', 'w')
f.write(dat)
f.close()

# load the data back in
from numpy import frombuffer
nrecords = 5
bytesize = 8
recordsize = nrecords * bytesize
data = sc.binaryRecords('test.bin', recordsize)
parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))

# these should be equal
parsed.first()
dat[0,:]

Compared to something like Parquet, this is a little lighter-weight, and plays 
nicer with non-distributed data science tools (e.g. numpy). It also scales 
great (we use it routinely to process TBs of time series). And handles single 
files or directories. But it's extremely simple!

-------------------------
jeremyfreeman.net<http://jeremyfreeman.net>
@thefreemanlab

On Mar 26, 2015, at 2:33 PM, Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:

Thanks for suggestion, but libsvm is a format for sparse data storing in text 
file and I have dense vectors. In my opinion, text format is not appropriate 
for storing large dense vectors due to overhead related to parsing from string 
to digits and also storing digits as strings is not efficient.

From: Stephen Boesch [mailto:java...@gmail.com]
Sent: Thursday, March 26, 2015 2:27 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Storing large data for MLlib machine learning

There are some convenience methods you might consider including:

          MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com>>:
Hi,

Could you suggest what would be the reasonable file format to store feature 
vector data for machine learning in Spark MLlib? Are there any best practices 
for Spark?

My data is dense feature vectors with labels. Some of the requirements are that 
the format should be easy loaded/serialized, randomly accessible, with a small 
footprint (binary). I am considering Parquet, hdf5, protocol buffer (protobuf), 
but I have little to no experience with them, so any suggestions would be 
really appreciated.

Best regards, Alexander

RE: Storing large data for MLlib machine learning

Reply via email to