Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Duy Huynh
both. first, the distributed version is so much slower than python. i tried a few things like broadcasting variables, replacing Seq with Array, and a few other little things. it helps to improve the performance, but still slower than the python code. so, i wrote a local version that's pretty

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Duy Huynh
seconds per iteration i love spark and really enjoy writing scala code. but this huge difference in performance makes it really hard to do any kind of machine learning work. On Thu, Dec 11, 2014 at 2:18 PM, Duy Huynh duy.huynh@gmail.com wrote: both. first, the distributed version is so

Re: what is the best way to implement mini batches?

2014-12-11 Thread Duy Huynh
the dataset i'm working on has about 100,000 records. the batch that we're training on has a size around 10. can you repartition(10,000) into 10,000 partitions? On Thu, Dec 11, 2014 at 2:36 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You can just do mapPartitions on the whole RDD, and

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
://scikit-learn.org/stable/modules/model_persistence.html). These all seem basically equivalent java serialization to me.. Would some helper functions (in, say, mllib.util.modelpersistence or something) make sense to add? On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote

Re: sparse x sparse matrix multiplication

2014-11-07 Thread Duy Huynh
-sparse matrix multiplic]Xiangrui Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse matrix multiplication and then define an RDD of sub-matri From: Xiangrui Meng men...@gmail.com To: Duy Huynh duy.huynh@gmail.com Cc: user u...@spark.incubator.apache.org Date: 11/05

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
and simpler (using a cluster of machines is more for load balancing / fault tolerance). What is your use case for model serving? — Sent from Mailbox https://www.dropbox.com/mailbox On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh duy.huynh@gmail.com wrote: you're right, serialization works

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
. There's some work going on in the AMPLab to address this issue. On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com wrote: you're right, serialization works. what is your suggestion on saving a distributed model? so part of the model is in one cluster, and some other parts

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
these recommendations on a point-by-point basis might not be optimal. There's some work going on in the AMPLab to address this issue. On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com wrote: you're right, serialization works. what is your suggestion on saving a distributed model

Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Duy Huynh
that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Plain old java serialization is

Re: sparse x sparse matrix multiplication

2014-11-05 Thread Duy Huynh
distributed. something like CordinatedMatrix.multiply(CoordinatedMatrix). thanks xiangrui! On Wed, Nov 5, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote: local matrix-matrix multiplication or distributed? On Tue, Nov 4, 2014 at 11:58 PM, ll duy.huynh@gmail.com wrote: what is

Re: Matrix multiplication in spark

2014-11-05 Thread Duy Huynh
ok great. when will this be ready? On Wed, Nov 5, 2014 at 4:27 AM, Xiangrui Meng men...@gmail.com wrote: We are working on distributed block matrices. The main JIRA is at: https://issues.apache.org/jira/browse/SPARK-3434 The goal is to support basic distributed linear algebra, (dense first

Re: sparse x sparse matrix multiplication

2014-11-05 Thread Duy Huynh
in case, this won't be available anytime soon with spark. what would be a good way to implement this multiplication feature in spark? On Wed, Nov 5, 2014 at 4:59 AM, Duy Huynh duy.huynh@gmail.com wrote: distributed. something like CordinatedMatrix.multiply(CoordinatedMatrix). thanks

Re: object in an rdd: serializable?

2014-10-17 Thread Duy Huynh
interesting. why does case class work for this? thanks boromir! On Thu, Oct 16, 2014 at 10:41 PM, Boromir Widas vcsub...@gmail.com wrote: make it a case class should work. On Thu, Oct 16, 2014 at 8:30 PM, ll duy.huynh@gmail.com wrote: i got an exception complaining about serializable.

Re: scala: java.net.BindException?

2014-10-16 Thread Duy Huynh
thanks marcelo. i only instantiated sparkcontext once, at the beginning, in this code. the exception was thrown right at the beginning. i also tried to run other programs, which worked fine previously, but now also got the same error. it looks like it put global block on creating sparkcontext

Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
thanks ankur. indexedrdd sounds super helpful! a related question, what is the best way to update the values of existing vertices and edges? On Tue, Oct 14, 2014 at 4:30 PM, Ankur Dave ankurd...@gmail.com wrote: On Tue, Oct 14, 2014 at 12:36 PM, ll duy.huynh@gmail.com wrote: hi again.

Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
great, thanks! On Tue, Oct 14, 2014 at 5:08 PM, Ankur Dave ankurd...@gmail.com wrote: On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh duy.huynh@gmail.com wrote: a related question, what is the best way to update the values of existing vertices and edges? Many of the Graph methods deal

Re: mllib CoordinateMatrix

2014-10-14 Thread Duy Huynh
thanks reza! On Tue, Oct 14, 2014 at 5:02 PM, Reza Zadeh r...@databricks.com wrote: Hello, CoordinateMatrix is in its infancy, and right now is only a placeholder. To get/set the value at (i,j), you should map the entries rdd using the usual rdd map operation, and change the relevant