Re: Storing large data for MLlib machine learning

2015-04-01 Thread Hector Yee
I use Thrift and then base64 encode the binary and save it as text file
lines that are snappy or gzip encoded.

It makes it very easy to copy small chunks locally and play with subsets of
the data and not have dependencies on HDFS / hadoop for server stuff for
example.


On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Thanks, Evan. What do you think about Protobuf? Twitter has a library to
 manage protobuf files in hdfs https://github.com/twitter/elephant-bird


 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Thursday, March 26, 2015 2:34 PM
 To: Stephen Boesch
 Cc: Ulanov, Alexander; dev@spark.apache.org
 Subject: Re: Storing large data for MLlib machine learning

 On binary file formats - I looked at HDF5+Spark a couple of years ago and
 found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
 needed filenames as input, you couldn't pass it anything like an
 InputStream). I don't know if it has gotten any better.

 Parquet plays much more nicely and there are lots of spark-related
 projects using it already. Keep in mind that it's column-oriented which
 might impact performance - but basically you're going to want your features
 in a byte array and deser should be pretty straightforward.

 On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch java...@gmail.commailto:
 java...@gmail.com wrote:
 There are some convenience methods you might consider including:

MLUtils.loadLibSVMFile

 and   MLUtils.loadLabeledPoint

 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander alexander.ula...@hp.com
 mailto:alexander.ula...@hp.com:

  Hi,
 
  Could you suggest what would be the reasonable file format to store
  feature vector data for machine learning in Spark MLlib? Are there any
 best
  practices for Spark?
 
  My data is dense feature vectors with labels. Some of the requirements
 are
  that the format should be easy loaded/serialized, randomly accessible,
 with
  a small footprint (binary). I am considering Parquet, hdf5, protocol
 buffer
  (protobuf), but I have little to no experience with them, so any
  suggestions would be really appreciated.
 
  Best regards, Alexander
 




-- 
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYee http://google.com/+HectorYee*


RE: Storing large data for MLlib machine learning

2015-04-01 Thread Ulanov, Alexander
Thanks, sounds interesting! How do you load files to Spark? Did you consider 
having multiple files instead of file lines?

From: Hector Yee [mailto:hector@gmail.com]
Sent: Wednesday, April 01, 2015 11:36 AM
To: Ulanov, Alexander
Cc: Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

I use Thrift and then base64 encode the binary and save it as text file lines 
that are snappy or gzip encoded.

It makes it very easy to copy small chunks locally and play with subsets of the 
data and not have dependencies on HDFS / hadoop for server stuff for example.


On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Thanks, Evan. What do you think about Protobuf? Twitter has a library to manage 
protobuf files in hdfs https://github.com/twitter/elephant-bird


From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Thursday, March 26, 2015 2:34 PM
To: Stephen Boesch
Cc: Ulanov, Alexander; dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

On binary file formats - I looked at HDF5+Spark a couple of years ago and found 
it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed 
filenames as input, you couldn't pass it anything like an InputStream). I don't 
know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related projects 
using it already. Keep in mind that it's column-oriented which might impact 
performance - but basically you're going to want your features in a byte array 
and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch 
java...@gmail.commailto:java...@gmail.commailto:java...@gmail.commailto:java...@gmail.com
 wrote:
There are some convenience methods you might consider including:

   MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.com:

 Hi,

 Could you suggest what would be the reasonable file format to store
 feature vector data for machine learning in Spark MLlib? Are there any best
 practices for Spark?

 My data is dense feature vectors with labels. Some of the requirements are
 that the format should be easy loaded/serialized, randomly accessible, with
 a small footprint (binary). I am considering Parquet, hdf5, protocol buffer
 (protobuf), but I have little to no experience with them, so any
 suggestions would be really appreciated.

 Best regards, Alexander




--
Yee Yang Li Hectorhttp://google.com/+HectorYee
google.com/+HectorYeehttp://google.com/+HectorYee


RE: Storing large data for MLlib machine learning

2015-04-01 Thread Ulanov, Alexander
Jeremy, thanks for explanation!
What if instead you've used Parquet file format? You can still write a number 
of small files as you do, but you don't have to implement a writer/reader, 
because they are available for Parquet in various languages.

From: Jeremy Freeman [mailto:freeman.jer...@gmail.com]
Sent: Wednesday, April 01, 2015 1:37 PM
To: Hector Yee
Cc: Ulanov, Alexander; Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

@Alexander, re: using flat binary and metadata, you raise excellent points! At 
least in our case, we decided on a specific endianness, but do end up storing 
some extremely minimal specification in a JSON file, and have written importers 
and exporters within our library to parse it. While it does feel a little like 
reinvention, it's fast, direct, and scalable, and seems pretty sensible if you 
know your data will be dense arrays of numerical features.

-
jeremyfreeman.nethttp://jeremyfreeman.net
@thefreemanlab

On Apr 1, 2015, at 3:52 PM, Hector Yee 
hector@gmail.commailto:hector@gmail.com wrote:


Just using sc.textfile then a .map(decode)
Yes by default it is multiple files .. our training data is 1TB gzipped
into 5000 shards.

On Wed, Apr 1, 2015 at 12:32 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com
wrote:


Thanks, sounds interesting! How do you load files to Spark? Did you
consider having multiple files instead of file lines?



*From:* Hector Yee [mailto:hector@gmail.com]
*Sent:* Wednesday, April 01, 2015 11:36 AM
*To:* Ulanov, Alexander
*Cc:* Evan R. Sparks; Stephen Boesch; 
dev@spark.apache.orgmailto:dev@spark.apache.org

*Subject:* Re: Storing large data for MLlib machine learning



I use Thrift and then base64 encode the binary and save it as text file
lines that are snappy or gzip encoded.



It makes it very easy to copy small chunks locally and play with subsets
of the data and not have dependencies on HDFS / hadoop for server stuff for
example.





On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:

Thanks, Evan. What do you think about Protobuf? Twitter has a library to
manage protobuf files in hdfshttps://github.com/twitter/elephant-bird


From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Thursday, March 26, 2015 2:34 PM
To: Stephen Boesch
Cc: Ulanov, Alexander; dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning

On binary file formats - I looked at HDF5+Spark a couple of years ago and
found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
needed filenames as input, you couldn't pass it anything like an
InputStream). I don't know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related
projects using it already. Keep in mind that it's column-oriented which
might impact performance - but basically you're going to want your features
in a byte array and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch 
java...@gmail.commailto:java...@gmail.commailto:
java...@gmail.commailto:java...@gmail.com wrote:
There are some convenience methods you might consider including:

  MLUtils.loadLibSVMFile

and   MLUtils.loadLabeledPoint

2015-03-26 14:16 GMT-07:00 Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com
mailto:alexander.ula...@hp.com:



Hi,

Could you suggest what would be the reasonable file format to store
feature vector data for machine learning in Spark MLlib? Are there any
best

practices for Spark?

My data is dense feature vectors with labels. Some of the requirements
are

that the format should be easy loaded/serialized, randomly accessible,
with

a small footprint (binary). I am considering Parquet, hdf5, protocol
buffer

(protobuf), but I have little to no experience with them, so any
suggestions would be really appreciated.

Best regards, Alexander





--

Yee Yang Li Hector http://google.com/+HectorYee

*google.com/+HectorYeehttp://google.com/+HectorYee 
http://google.com/+HectorYee*



--
Yee Yang Li Hector http://google.com/+HectorYee
*google.com/+HectorYeehttp://google.com/+HectorYee 
http://google.com/+HectorYee*



Re: Storing large data for MLlib machine learning

2015-03-26 Thread Evan R. Sparks
On binary file formats - I looked at HDF5+Spark a couple of years ago and
found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
needed filenames as input, you couldn't pass it anything like an
InputStream). I don't know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related projects
using it already. Keep in mind that it's column-oriented which might impact
performance - but basically you're going to want your features in a byte
array and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch java...@gmail.com wrote:

 There are some convenience methods you might consider including:

MLUtils.loadLibSVMFile

 and   MLUtils.loadLabeledPoint

 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander alexander.ula...@hp.com:

  Hi,
 
  Could you suggest what would be the reasonable file format to store
  feature vector data for machine learning in Spark MLlib? Are there any
 best
  practices for Spark?
 
  My data is dense feature vectors with labels. Some of the requirements
 are
  that the format should be easy loaded/serialized, randomly accessible,
 with
  a small footprint (binary). I am considering Parquet, hdf5, protocol
 buffer
  (protobuf), but I have little to no experience with them, so any
  suggestions would be really appreciated.
 
  Best regards, Alexander