On binary file formats - I looked at HDF5+Spark a couple of years ago and found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed filenames as input, you couldn't pass it anything like an InputStream). I don't know if it has gotten any better.
Parquet plays much more nicely and there are lots of spark-related projects using it already. Keep in mind that it's column-oriented which might impact performance - but basically you're going to want your features in a byte array and deser should be pretty straightforward. On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <java...@gmail.com> wrote: > There are some convenience methods you might consider including: > > MLUtils.loadLibSVMFile > > and MLUtils.loadLabeledPoint > > 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ula...@hp.com>: > > > Hi, > > > > Could you suggest what would be the reasonable file format to store > > feature vector data for machine learning in Spark MLlib? Are there any > best > > practices for Spark? > > > > My data is dense feature vectors with labels. Some of the requirements > are > > that the format should be easy loaded/serialized, randomly accessible, > with > > a small footprint (binary). I am considering Parquet, hdf5, protocol > buffer > > (protobuf), but I have little to no experience with them, so any > > suggestions would be really appreciated. > > > > Best regards, Alexander > > >