Thanks for suggestion, but libsvm is a format for sparse data storing in text file and I have dense vectors. In my opinion, text format is not appropriate for storing large dense vectors due to overhead related to parsing from string to digits and also storing digits as strings is not efficient.
From: Stephen Boesch [mailto:java...@gmail.com] Sent: Thursday, March 26, 2015 2:27 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Storing large data for MLlib machine learning There are some convenience methods you might consider including: MLUtils.loadLibSVMFile and MLUtils.loadLabeledPoint 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander <alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>: Hi, Could you suggest what would be the reasonable file format to store feature vector data for machine learning in Spark MLlib? Are there any best practices for Spark? My data is dense feature vectors with labels. Some of the requirements are that the format should be easy loaded/serialized, randomly accessible, with a small footprint (binary). I am considering Parquet, hdf5, protocol buffer (protobuf), but I have little to no experience with them, so any suggestions would be really appreciated. Best regards, Alexander