Re: what is the best way to implement mini batches?

2014-12-11 Thread Duy Huynh
the dataset i'm working on has about 100,000 records. the batch that we're training on has a size around 10. can you repartition(10,000) into 10,000 partitions? On Thu, Dec 11, 2014 at 2:36 PM, Matei Zaharia wrote: > You can just do mapPartitions on the whole RDD, and then called sliding() > o

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Duy Huynh
seconds per iteration i love spark and really enjoy writing scala code. but this huge difference in performance makes it really hard to do any kind of machine learning work. On Thu, Dec 11, 2014 at 2:18 PM, Duy Huynh wrote: > both. > > first, the distributed version is so much sl

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Duy Huynh
both. first, the distributed version is so much slower than python. i tried a few things like broadcasting variables, replacing Seq with Array, and a few other little things. it helps to improve the performance, but still slower than the python code. so, i wrote a local version that's pretty mu

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
to a class >> of MatrixFactorizationModel. That class is package private to MLlib right >> now, so you'd need to copy the logic over to a new class, but that's the >> basic idea. >> >> That said - using spark to serve these recommendations on a >> point-

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
these recommendations on a point-by-point > basis might not be optimal. There's some work going on in the AMPLab to > address this issue. > > On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh wrote: > >> you're right, serialization works. >> >> what is your sugges

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
aster and simpler > (using a cluster of machines is more for load balancing / fault tolerance). > > What is your use case for model serving? > > — > Sent from Mailbox <https://www.dropbox.com/mailbox> > > > On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh wrote: > >&

Re: sparse x sparse matrix multiplication

2014-11-07 Thread Duy Huynh
Watson Research Center >> >> >> [image: Inactive hide details for Xiangrui Meng ---11/05/2014 01:13:40 >> PM---You can use breeze for local sparse-sparse matrix multiplic]Xiangrui >> Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse >> matrix

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
kit-learn docs recommend pickling - > http://scikit-learn.org/stable/modules/model_persistence.html). These all > seem basically equivalent java serialization to me.. > > Would some helper functions (in, say, mllib.util.modelpersistence or > something) make sense to add? > > On Thu, N

Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Duy Huynh
that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks wrote: > Plain old java serialization is one straightforward app

Re: sparse x sparse matrix multiplication

2014-11-05 Thread Duy Huynh
in case, this won't be available anytime soon with spark. what would be a good way to implement this multiplication feature in spark? On Wed, Nov 5, 2014 at 4:59 AM, Duy Huynh wrote: > distributed. something like CordinatedMatrix.multiply(CoordinatedMatrix). > > > thanks xia

Re: Matrix multiplication in spark

2014-11-05 Thread Duy Huynh
ok great. when will this be ready? On Wed, Nov 5, 2014 at 4:27 AM, Xiangrui Meng wrote: > We are working on distributed block matrices. The main JIRA is at: > > https://issues.apache.org/jira/browse/SPARK-3434 > > The goal is to support basic distributed linear algebra, (dense first > and then

Re: sparse x sparse matrix multiplication

2014-11-05 Thread Duy Huynh
distributed. something like CordinatedMatrix.multiply(CoordinatedMatrix). thanks xiangrui! On Wed, Nov 5, 2014 at 4:24 AM, Xiangrui Meng wrote: > local matrix-matrix multiplication or distributed? > > On Tue, Nov 4, 2014 at 11:58 PM, ll wrote: > > what is the best way to implement a sparse x

Re: object in an rdd: serializable?

2014-10-17 Thread Duy Huynh
interesting. why does case class work for this? thanks boromir! On Thu, Oct 16, 2014 at 10:41 PM, Boromir Widas wrote: > make it a case class should work. > > On Thu, Oct 16, 2014 at 8:30 PM, ll wrote: > >> i got an exception complaining about serializable. the sample code is >> below... >>

Re: scala: java.net.BindException?

2014-10-16 Thread Duy Huynh
thanks marcelo. i only instantiated sparkcontext once, at the beginning, in this code. the exception was thrown right at the beginning. i also tried to run other programs, which worked fine previously, but now also got the same error. it looks like it put "global block" on creating sparkcontext

Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
great, thanks! On Tue, Oct 14, 2014 at 5:08 PM, Ankur Dave wrote: > On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh > wrote: > >> a related question, what is the best way to update the values of existing >> vertices and edges? >> > > Many of the Graph methods deal

Re: mllib CoordinateMatrix

2014-10-14 Thread Duy Huynh
thanks reza! On Tue, Oct 14, 2014 at 5:02 PM, Reza Zadeh wrote: > Hello, > > CoordinateMatrix is in its infancy, and right now is only a placeholder. > > To get/set the value at (i,j), you should map the entries rdd using the > usual rdd map operation, and change the relevant entries. > > To get

Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
thanks ankur. indexedrdd sounds super helpful! a related question, what is the best way to update the values of existing vertices and edges? On Tue, Oct 14, 2014 at 4:30 PM, Ankur Dave wrote: > On Tue, Oct 14, 2014 at 12:36 PM, ll wrote: > >> hi again. just want to check in again to see if a