Please comment in the JIRA/SPIP if you are interested! We can see the community support for a proposal like this.
________________________________ From: Pola Yao <pola....@gmail.com> Sent: Wednesday, January 23, 2019 8:01 AM To: Riccardo Ferrari Cc: Felix Cheung; User Subject: Re: I have trained a ML model, now what? Hi Riccardo, Right now, Spark does not support low-latency predictions in Production. MLeap is an alternative and it's been used in many scenarios. But it's good to see that Spark Community has decided to provide such support. On Wed, Jan 23, 2019 at 7:53 AM Riccardo Ferrari <ferra...@gmail.com<mailto:ferra...@gmail.com>> wrote: Felix, thank you very much for the link. Much appreciated. The attached PDF is very interesting, I found myself evaluating many of the scenarios described in Q3. It's unfortunate the proposal is not being worked on, would be great to see that part of the code base. It is cool to see big players like Uber trying to make Open Source better, thanks! On Tue, Jan 22, 2019 at 5:24 PM Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote: About deployment/serving SPIP https://issues.apache.org/jira/browse/SPARK-26247 ________________________________ From: Riccardo Ferrari <ferra...@gmail.com<mailto:ferra...@gmail.com>> Sent: Tuesday, January 22, 2019 8:07 AM To: User Subject: I have trained a ML model, now what? Hi list! I am writing here to here about your experience on putting Spark ML models into production at scale. I know it is a very broad topic with many different faces depending on the use-case, requirements, user base and whatever is involved in the task. Still I'd like to open a thread about this topic that is as important as properly training a model and I feel is often neglected. The task is serving web users with predictions and the main challenge I see is making it agile and swift. I think there are mainly 3 general categories of such deployment that can be described as: * Offline/Batch: Load a model, performs the inference, store the results in some datasotre (DB, indexes,...) * Spark in the loop: Having a long running Spark context exposed in some way, this include streaming as well as some custom application that wraps the context. * Use a different technology to load the Spark MLlib model and run the inference pipeline. I have read about MLeap and other PMML based solutions. I would love to hear about opensource solutions and possibly without requiring cloud provider specific framework/component. Again I am aware each of the previous category have benefits and drawback, so what would you pick? Why? and how? Thanks!