Protobufs are great for serializing individual records - but parquet is good for efficiently storing a whole bunch of these objects.
Matt Massie has a good (slightly dated) blog post on using Spark+Parquet+Avro (and you can pretty much s/Avro/Protobuf/) describing how they all work together here: http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ Your use case (storing dense features, presumably as a single column) is pretty straightforward and the extra layers of indirection are maybe overkill. Lastly - you might consider using some of SparkSQL/DataFrame's built-in features for persistence, which support lots of storage backends. https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <alexander.ula...@hp.com> wrote: > Thanks, Evan. What do you think about Protobuf? Twitter has a library to > manage protobuf files in hdfs https://github.com/twitter/elephant-bird > > > > > > *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] > *Sent:* Thursday, March 26, 2015 2:34 PM > *To:* Stephen Boesch > *Cc:* Ulanov, Alexander; dev@spark.apache.org > *Subject:* Re: Storing large data for MLlib machine learning > > > > On binary file formats - I looked at HDF5+Spark a couple of years ago and > found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs > needed filenames as input, you couldn't pass it anything like an > InputStream). I don't know if it has gotten any better. > > > > Parquet plays much more nicely and there are lots of spark-related > projects using it already. Keep in mind that it's column-oriented which > might impact performance - but basically you're going to want your features > in a byte array and deser should be pretty straightforward. > > > > On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <java...@gmail.com> wrote: > > There are some convenience methods you might consider including: > > MLUtils.loadLibSVMFile > > and MLUtils.loadLabeledPoint > > 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ula...@hp.com>: > > > > Hi, > > > > Could you suggest what would be the reasonable file format to store > > feature vector data for machine learning in Spark MLlib? Are there any > best > > practices for Spark? > > > > My data is dense feature vectors with labels. Some of the requirements > are > > that the format should be easy loaded/serialized, randomly accessible, > with > > a small footprint (binary). I am considering Parquet, hdf5, protocol > buffer > > (protobuf), but I have little to no experience with them, so any > > suggestions would be really appreciated. > > > > Best regards, Alexander > > > > >