both.
first, the distributed version is so much slower than python. i tried a
few things like broadcasting variables, replacing Seq with Array, and a few
other little things. it helps to improve the performance, but still slower
than the python code.
so, i wrote a local version that's pretty
seconds per iteration
i love spark and really enjoy writing scala code. but this huge difference
in performance makes it really hard to do any kind of machine learning work.
On Thu, Dec 11, 2014 at 2:18 PM, Duy Huynh duy.huynh@gmail.com wrote:
both.
first, the distributed version is so
the dataset i'm working on has about 100,000 records. the batch that we're
training on has a size around 10. can you repartition(10,000) into 10,000
partitions?
On Thu, Dec 11, 2014 at 2:36 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
You can just do mapPartitions on the whole RDD, and
://scikit-learn.org/stable/modules/model_persistence.html). These all
seem basically equivalent java serialization to me..
Would some helper functions (in, say, mllib.util.modelpersistence or
something) make sense to add?
On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com
wrote
-sparse matrix multiplic]Xiangrui
Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse
matrix multiplication and then define an RDD of sub-matri
From: Xiangrui Meng men...@gmail.com
To: Duy Huynh duy.huynh@gmail.com
Cc: user u...@spark.incubator.apache.org
Date: 11/05
and simpler
(using a cluster of machines is more for load balancing / fault tolerance).
What is your use case for model serving?
—
Sent from Mailbox https://www.dropbox.com/mailbox
On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh duy.huynh@gmail.com wrote:
you're right, serialization works
. There's some work going on in the AMPLab to
address this issue.
On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com wrote:
you're right, serialization works.
what is your suggestion on saving a distributed model? so part of the
model is in one cluster, and some other parts
these recommendations on a
point-by-point basis might not be optimal. There's some work going on in
the AMPLab to address this issue.
On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com
wrote:
you're right, serialization works.
what is your suggestion on saving a distributed model
that works. is there a better way in spark? this seems like the most
common feature for any machine learning work - to be able to save your
model after training it and load it later.
On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
wrote:
Plain old java serialization is
distributed. something like CordinatedMatrix.multiply(CoordinatedMatrix).
thanks xiangrui!
On Wed, Nov 5, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote:
local matrix-matrix multiplication or distributed?
On Tue, Nov 4, 2014 at 11:58 PM, ll duy.huynh@gmail.com wrote:
what is
ok great. when will this be ready?
On Wed, Nov 5, 2014 at 4:27 AM, Xiangrui Meng men...@gmail.com wrote:
We are working on distributed block matrices. The main JIRA is at:
https://issues.apache.org/jira/browse/SPARK-3434
The goal is to support basic distributed linear algebra, (dense first
in case, this won't be available anytime soon with spark. what would be a
good way to implement this multiplication feature in spark?
On Wed, Nov 5, 2014 at 4:59 AM, Duy Huynh duy.huynh@gmail.com wrote:
distributed. something like CordinatedMatrix.multiply(CoordinatedMatrix).
thanks
interesting. why does case class work for this? thanks boromir!
On Thu, Oct 16, 2014 at 10:41 PM, Boromir Widas vcsub...@gmail.com wrote:
make it a case class should work.
On Thu, Oct 16, 2014 at 8:30 PM, ll duy.huynh@gmail.com wrote:
i got an exception complaining about serializable.
thanks marcelo. i only instantiated sparkcontext once, at the beginning,
in this code. the exception was thrown right at the beginning.
i also tried to run other programs, which worked fine previously, but now
also got the same error.
it looks like it put global block on creating sparkcontext
thanks ankur. indexedrdd sounds super helpful!
a related question, what is the best way to update the values of existing
vertices and edges?
On Tue, Oct 14, 2014 at 4:30 PM, Ankur Dave ankurd...@gmail.com wrote:
On Tue, Oct 14, 2014 at 12:36 PM, ll duy.huynh@gmail.com wrote:
hi again.
great, thanks!
On Tue, Oct 14, 2014 at 5:08 PM, Ankur Dave ankurd...@gmail.com wrote:
On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh duy.huynh@gmail.com
wrote:
a related question, what is the best way to update the values of existing
vertices and edges?
Many of the Graph methods deal
thanks reza!
On Tue, Oct 14, 2014 at 5:02 PM, Reza Zadeh r...@databricks.com wrote:
Hello,
CoordinateMatrix is in its infancy, and right now is only a placeholder.
To get/set the value at (i,j), you should map the entries rdd using the
usual rdd map operation, and change the relevant
17 matches
Mail list logo