Protobufs are great for serializing individual records - but parquet is
good for efficiently storing a whole bunch of these objects.

Matt Massie has a good (slightly dated) blog post on using
Spark+Parquet+Avro (and you can pretty much s/Avro/Protobuf/) describing
how they all work together here:
http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/

Your use case (storing dense features, presumably as a single column) is
pretty straightforward and the extra layers of indirection are maybe
overkill.

Lastly - you might consider using some of SparkSQL/DataFrame's built-in
features for persistence, which support lots of storage backends.
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources

On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander <alexander.ula...@hp.com>
wrote:

>  Thanks, Evan. What do you think about Protobuf? Twitter has a library to
> manage protobuf files in hdfs https://github.com/twitter/elephant-bird
>
>
>
>
>
> *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
> *Sent:* Thursday, March 26, 2015 2:34 PM
> *To:* Stephen Boesch
> *Cc:* Ulanov, Alexander; dev@spark.apache.org
> *Subject:* Re: Storing large data for MLlib machine learning
>
>
>
> On binary file formats - I looked at HDF5+Spark a couple of years ago and
> found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
> needed filenames as input, you couldn't pass it anything like an
> InputStream). I don't know if it has gotten any better.
>
>
>
> Parquet plays much more nicely and there are lots of spark-related
> projects using it already. Keep in mind that it's column-oriented which
> might impact performance - but basically you're going to want your features
> in a byte array and deser should be pretty straightforward.
>
>
>
> On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch <java...@gmail.com> wrote:
>
> There are some convenience methods you might consider including:
>
>            MLUtils.loadLibSVMFile
>
> and   MLUtils.loadLabeledPoint
>
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ula...@hp.com>:
>
>
> > Hi,
> >
> > Could you suggest what would be the reasonable file format to store
> > feature vector data for machine learning in Spark MLlib? Are there any
> best
> > practices for Spark?
> >
> > My data is dense feature vectors with labels. Some of the requirements
> are
> > that the format should be easy loaded/serialized, randomly accessible,
> with
> > a small footprint (binary). I am considering Parquet, hdf5, protocol
> buffer
> > (protobuf), but I have little to no experience with them, so any
> > suggestions would be really appreciated.
> >
> > Best regards, Alexander
> >
>
>
>

Reply via email to