Thanks, Jeremy! I also work with time series data right now, so your suggestions are really relevant. However, we want to handle not the raw data, but already processed and prepared for machine learning.
Initially, we also wanted to have our own simple binary format, but we could not agree on handling little/big endian. We did not agree if we have to stick to a specific endian or to ship this information in metadata file. And metadata file sounds like another data format engineering (aka inventing the bicycle). Does this make sense to you? From: Jeremy Freeman [mailto:freeman.jer...@gmail.com] Sent: Thursday, March 26, 2015 3:01 PM To: Ulanov, Alexander Cc: Stephen Boesch; dev@spark.apache.org Subject: Re: Storing large data for MLlib machine learning Hi Ulvanov, great question, we've encountered it frequently with scientific data (e.g. time series). Agreed text is inefficient for dense arrays, and we also found HDF5+Spark to be a pain. Our strategy has been flat binary files with fixed length records. Loading these is now supported in Spark via the binaryRecords method, which wraps a custom Hadoop InputFormat we wrote. An example (in python): # write data from an array from numpy import random dat = random.randn(100,5) f = open('test.bin', 'w') f.write(dat) f.close() # load the data back in from numpy import frombuffer nrecords = 5 bytesize = 8 recordsize = nrecords * bytesize data = sc.binaryRecords('test.bin', recordsize) parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float')) # these should be equal parsed.first() dat[0,:] Compared to something like Parquet, this is a little lighter-weight, and plays nicer with non-distributed data science tools (e.g. numpy). It also scales great (we use it routinely to process TBs of time series). And handles single files or directories. But it's extremely simple! ------------------------- jeremyfreeman.net<http://jeremyfreeman.net> @thefreemanlab On Mar 26, 2015, at 2:33 PM, Ulanov, Alexander <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: Thanks for suggestion, but libsvm is a format for sparse data storing in text file and I have dense vectors. In my opinion, text format is not appropriate for storing large dense vectors due to overhead related to parsing from string to digits and also storing digits as strings is not efficient. From: Stephen Boesch [mailto:java...@gmail.com] Sent: Thursday, March 26, 2015 2:27 PM To: Ulanov, Alexander Cc: dev@spark.apache.org<mailto:dev@spark.apache.org> Subject: Re: Storing large data for MLlib machine learning There are some convenience methods you might consider including: MLUtils.loadLibSVMFile and MLUtils.loadLabeledPoint 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com>>: Hi, Could you suggest what would be the reasonable file format to store feature vector data for machine learning in Spark MLlib? Are there any best practices for Spark? My data is dense feature vectors with labels. Some of the requirements are that the format should be easy loaded/serialized, randomly accessible, with a small footprint (binary). I am considering Parquet, hdf5, protocol buffer (protobuf), but I have little to no experience with them, so any suggestions would be really appreciated. Best regards, Alexander