yep, but that's only if they are already represented as RDDs. which is much
more convenient for saving and loading.

my question is for the use case that they are not represented as RDDs yet.

then, do you think if it makes sense to covert them into RDDs, just for the
convenience of saving and loading them distributedly?

On Fri, Nov 7, 2014 at 12:36 PM, Evan R. Sparks <evan.spa...@gmail.com>
wrote:

> There are a few examples where this is the case. Let's take ALS, where the
> result is a MatrixFactorizationModel, which is assumed to be big - the
> model consists of two matrices, one (users x k) and one (k x products).
> These are represented as RDDs.
>
> You can save these RDDs out to disk by doing something like
>
> model.userFeatures.saveAsObjectFile(...) and
> model.productFeatures.saveAsObjectFile(...)
>
> to save out to HDFS or Tachyon or S3.
>
> Then, when you want to reload you'd have to instantiate them into a class
> of MatrixFactorizationModel. That class is package private to MLlib right
> now, so you'd need to copy the logic over to a new class, but that's the
> basic idea.
>
> That said - using spark to serve these recommendations on a point-by-point
> basis might not be optimal. There's some work going on in the AMPLab to
> address this issue.
>
> On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh <duy.huynh....@gmail.com> wrote:
>
>> you're right, serialization works.
>>
>> what is your suggestion on saving a "distributed" model?  so part of the
>> model is in one cluster, and some other parts of the model are in other
>> clusters.  during runtime, these sub-models run independently in their own
>> clusters (load, train, save).  and at some point during run time these
>> sub-models merge into the master model, which also loads, trains, and saves
>> at the master level.
>>
>> much appreciated.
>>
>>
>>
>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <evan.spa...@gmail.com>
>> wrote:
>>
>>> There's some work going on to support PMML -
>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
>>> been merged into master.
>>>
>>> What are you used to doing in other environments? In R I'm used to
>>> running save(), same with matlab. In python either pickling things or
>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>> pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
>>> These all seem basically equivalent java serialization to me..
>>>
>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>> something) make sense to add?
>>>
>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <duy.huynh....@gmail.com>
>>> wrote:
>>>
>>>> that works.  is there a better way in spark?  this seems like the most
>>>> common feature for any machine learning work - to be able to save your
>>>> model after training it and load it later.
>>>>
>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <evan.spa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Plain old java serialization is one straightforward approach if you're
>>>>> in java/scala.
>>>>>
>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <duy.huynh....@gmail.com> wrote:
>>>>>
>>>>>> what is the best way to save an mllib model that you just trained and
>>>>>> reload
>>>>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>>>>> thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to