Just want to elaborate more on Duy's suggestion on using PredictionIO. PredictionIO will store the model automatically if you return it in the training function. An example using CF:
def train(data: PreparedData): PersistentMatrixFactorizationModel = { val m = ALS.train(data.ratings, ap.rank, ap.numIterations, ap.lambda) new PersistentMatrixFactorizationModel( rank = m.rank, userFeatures = m.userFeatures, productFeatures = m.productFeatures) } And the persisted model will be passed to the predict function when you query for prediction: def predict( model: PersistentMatrixFactorizationModel, query: Query): PredictedResult = { val productScores = model.recommendProducts(query.user, query.num) .map (r => ProductScore(r.product, r.rating)) new PredictedResult(productScores)} Some templates and tutorials for MLlib are here: http://docs.prediction.io/0.8.1/templates/ Simon On Fri, Nov 7, 2014 at 10:11 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Sure - in theory this sounds great. But in practice it's much faster and a > whole lot simpler to just serve the model from single instance in memory. > Optionally you can multithread within that (as Oryx 1 does). > > There are very few real world use cases where the model is so large that > it HAS to be distributed. > > Having said this, it's certainly possible to distribute model serving for > factor-like models (like ALS). One idea I'm working on now is using > Elasticsearch for exactly this purpose - but that more because I'm using it > for filtering of recommendation results and combining with search, so > overall it's faster to do it this way. > > For the pure matrix algebra part, single instance in memory is way faster. > > — > Sent from Mailbox <https://www.dropbox.com/mailbox> > > > On Fri, Nov 7, 2014 at 8:00 PM, Duy Huynh <duy.huynh....@gmail.com> wrote: > >> hi nick.. sorry about the confusion. originally i had a question >> specifically about word2vec, but my follow up question on distributed model >> is a more general question about saving different types of models. >> >> on distributed model, i was hoping to implement a model parallelism, so >> that different workers can work on different parts of the models, and then >> merge the results at the end at the single master model. >> >> >> >> On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath <nick.pentre...@gmail.com >> > wrote: >> >>> Currently I see the word2vec model is collected onto the master, so the >>> model itself is not distributed. >>> >>> I guess the question is why do you need a distributed model? Is the >>> vocab size so large that it's necessary? For model serving in general, >>> unless the model is truly massive (ie cannot fit into memory on a modern >>> high end box with 64, or 128GB ram) then single instance is way faster and >>> simpler (using a cluster of machines is more for load balancing / fault >>> tolerance). >>> >>> What is your use case for model serving? >>> >>> — >>> Sent from Mailbox <https://www.dropbox.com/mailbox> >>> >>> >>> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh <duy.huynh....@gmail.com> >>> wrote: >>> >>>> you're right, serialization works. >>>> >>>> what is your suggestion on saving a "distributed" model? so part of >>>> the model is in one cluster, and some other parts of the model are in other >>>> clusters. during runtime, these sub-models run independently in their own >>>> clusters (load, train, save). and at some point during run time these >>>> sub-models merge into the master model, which also loads, trains, and saves >>>> at the master level. >>>> >>>> much appreciated. >>>> >>>> >>>> >>>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <evan.spa...@gmail.com> >>>> wrote: >>>> >>>>> There's some work going on to support PMML - >>>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet >>>>> been merged into master. >>>>> >>>>> What are you used to doing in other environments? In R I'm used to >>>>> running save(), same with matlab. In python either pickling things or >>>>> dumping to json seems pretty common. (even the scikit-learn docs recommend >>>>> pickling - >>>>> http://scikit-learn.org/stable/modules/model_persistence.html). These >>>>> all seem basically equivalent java serialization to me.. >>>>> >>>>> Would some helper functions (in, say, mllib.util.modelpersistence or >>>>> something) make sense to add? >>>>> >>>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <duy.huynh....@gmail.com> >>>>> wrote: >>>>> >>>>>> that works. is there a better way in spark? this seems like the >>>>>> most common feature for any machine learning work - to be able to save >>>>>> your >>>>>> model after training it and load it later. >>>>>> >>>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <evan.spa...@gmail.com >>>>>> > wrote: >>>>>> >>>>>>> Plain old java serialization is one straightforward approach if >>>>>>> you're in java/scala. >>>>>>> >>>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <duy.huynh....@gmail.com> wrote: >>>>>>> >>>>>>>> what is the best way to save an mllib model that you just trained >>>>>>>> and reload >>>>>>>> it in the future? specifically, i'm using the mllib word2vec >>>>>>>> model... >>>>>>>> thanks. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> View this message in context: >>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html >>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>> Nabble.com. >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >