Thanks, Evan. What do you think about Protobuf? Twitter has a library to manage protobuf files in hdfs https://github.com/twitter/elephant-bird
From: Evan R. Sparks [mailto:[email protected]] Sent: Thursday, March 26, 2015 2:34 PM To: Stephen Boesch Cc: Ulanov, Alexander; [email protected] Subject: Re: Storing large data for MLlib machine learning On binary file formats - I looked at HDF5+Spark a couple of years ago and found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed filenames as input, you couldn't pass it anything like an InputStream). I don't know if it has gotten any better. Parquet plays much more nicely and there are lots of spark-related projects using it already. Keep in mind that it's column-oriented which might impact performance - but basically you're going to want your features in a byte array and deser should be pretty straightforward. On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <[email protected]<mailto:[email protected]>> wrote: There are some convenience methods you might consider including: MLUtils.loadLibSVMFile and MLUtils.loadLabeledPoint 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <[email protected]<mailto:[email protected]>>: > Hi, > > Could you suggest what would be the reasonable file format to store > feature vector data for machine learning in Spark MLlib? Are there any best > practices for Spark? > > My data is dense feature vectors with labels. Some of the requirements are > that the format should be easy loaded/serialized, randomly accessible, with > a small footprint (binary). I am considering Parquet, hdf5, protocol buffer > (protobuf), but I have little to no experience with them, so any > suggestions would be really appreciated. > > Best regards, Alexander >
