Thanks for suggestion, but libsvm is a format for sparse data storing in text 
file and I have dense vectors. In my opinion, text format is not appropriate 
for storing large dense vectors due to overhead related to parsing from string 
to digits and also storing digits as strings is not efficient.

From: Stephen Boesch [mailto:java...@gmail.com]
Sent: Thursday, March 26, 2015 2:27 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

There are some convenience methods you might consider including:

           MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander 
<alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>:
Hi,

Could you suggest what would be the reasonable file format to store feature 
vector data for machine learning in Spark MLlib? Are there any best practices 
for Spark?

My data is dense feature vectors with labels. Some of the requirements are that 
the format should be easy loaded/serialized, randomly accessible, with a small 
footprint (binary). I am considering Parquet, hdf5, protocol buffer (protobuf), 
but I have little to no experience with them, so any suggestions would be 
really appreciated.

Best regards, Alexander

Reply via email to