Thanks, Evan. What do you think about Protobuf? Twitter has a library to manage protobuf files in hdfs https://github.com/twitter/elephant-bird
From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, March 26, 2015 2:34 PM To: Stephen Boesch Cc: Ulanov, Alexander; dev@spark.apache.org Subject: Re: Storing large data for MLlib machine learning On binary file formats - I looked at HDF5+Spark a couple of years ago and found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed filenames as input, you couldn't pass it anything like an InputStream). I don't know if it has gotten any better. Parquet plays much more nicely and there are lots of spark-related projects using it already. Keep in mind that it's column-oriented which might impact performance - but basically you're going to want your features in a byte array and deser should be pretty straightforward. On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <java...@gmail.com<mailto:java...@gmail.com>> wrote: There are some convenience methods you might consider including: MLUtils.loadLibSVMFile and MLUtils.loadLabeledPoint 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>: > Hi, > > Could you suggest what would be the reasonable file format to store > feature vector data for machine learning in Spark MLlib? Are there any best > practices for Spark? > > My data is dense feature vectors with labels. Some of the requirements are > that the format should be easy loaded/serialized, randomly accessible, with > a small footprint (binary). I am considering Parquet, hdf5, protocol buffer > (protobuf), but I have little to no experience with them, so any > suggestions would be really appreciated. > > Best regards, Alexander >