Re: thought experiment: use spark ML to real time prediction

DB Tsai Tue, 17 Nov 2015 13:36:04 -0800

I was thinking about to work on better version of PMML, JMML in JSON, but
as you said, this requires a dedicated team to define the standard which
will be a huge work.  However, option b) and c) still don't address the
distributed models issue. In fact, most of the models in production have to
be small enough to return the result to users within reasonable latency, so
I doubt how usefulness of the distributed models in real production
use-case. For R and Python, we can build a wrapper on-top of the
lightweight "spark-ml-common" project.



Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> I think the issue with pulling in all of spark-core is often with
> dependencies (and versions) conflicting with the web framework (or Akka in
> many cases). Plus it really is quite heavy if you just want a fairly
> lightweight model-serving app. For example we've built a fairly simple but
> scalable ALS factor model server on Scalatra, Akka and Breeze. So all you
> really need is the web framework and Breeze (or an alternative linear
> algebra lib).
>
> I definitely hear the pain-point that PMML might not be able to handle
> some types of transformations or models that exist in Spark. However,
> here's an example from scikit-learn -> PMML that may be instructive (
> https://github.com/scikit-learn/scikit-learn/issues/1596 and
> https://github.com/jpmml/jpmml-sklearn), where a fairly impressive list
> of estimators and transformers are supported (including e.g. scaling and
> encoding, and PCA).
>
> I definitely think the current model I/O and "export" or "deploy to
> production" situation needs to be improved substantially. However, you are
> left with the following options:
>
> (a) build out a lightweight "spark-ml-common" project that brings in the
> dependencies needed for production scoring / transformation in independent
> apps. However, here you only support Scala/Java - what about R and Python?
> Also, what about the distributed models? Perhaps "local" wrappers can be
> created, though this may not work for very large factor or LDA models. See
> also H20 example http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html
>
> (b) build out Spark's PMML support, and add missing stuff to PMML where
> possible. The benefit here is an existing standard with various tools for
> scoring (via REST server, Java app, Pig, Hive, various language support).
>
> (c) build out a more comprehensive I/O, serialization and scoring
> framework. Here you face the issue of supporting various predictors and
> transformers generically, across platforms and versioning. i.e. you're
> re-creating a new standard like PMML
>
> Option (a) is do-able, but I'm a bit concerned that it may be too "Spark
> specific", or even too "Scala / Java" specific. But it is still potentially
> very useful to Spark users to build this out and have a somewhat standard
> production serving framework and/or library (there are obviously existing
> options like PredictionIO etc).
>
> Option (b) is really building out the existing PMML support within Spark,
> so a lot of the initial work has already been done. I know some folks had
> (or have) licensing issues with some components of JPMML (e.g. the
> evaluator and REST server). But perhaps the solution here is to build an
> Apache2-licensed evaluator framework.
>
> Option (c) is obviously interesting - "let's build a better PMML (that
> uses JSON or whatever instead of XML!)". But it also seems like a huge
> amount of reinventing the wheel, and like any new standard would take time
> to garner wide support (if at all).
>
> It would be really useful to start to understand what the main missing
> pieces are in PMML - perhaps the lowest-hanging fruit is simply to
> contribute improvements or additions to PMML.
>
>
>
> On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> That may not be an issue if the app using the models runs by itself (not
>> bundled into an existing app), which may actually be the right way to
>> design it considering separation of concerns.
>>
>> Regards
>> Sab
>>
>> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai <dbt...@dbtsai.com> wrote:
>>
>>> This will bring the whole dependencies of spark will may break the web
>>> app.
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ----------------------------------------------------------
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote:
>>>>
>>>>> I agree 100%. Making the model requires large data and many cpus.
>>>>>
>>>>> Using it does not.
>>>>>
>>>>> This is a very useful side effect of ML models.
>>>>>
>>>>> If mlib can't use models outside spark that's a real shame.
>>>>>
>>>>
>>>> Well you can as mentioned earlier. You don't need Spark runtime for
>>>> predictions, save the serialized model and deserialize to use. (you need
>>>> the Spark Jars in the classpath though)
>>>>
>>>>>
>>>>>
>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>
>>>>>
>>>>> -------- Original message --------
>>>>> From: "Kothuvatiparambil, Viju" <
>>>>> viju.kothuvatiparam...@bankofamerica.com>
>>>>> Date: 11/12/2015 3:09 PM (GMT-05:00)
>>>>> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com>
>>>>> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando <
>>>>> nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>,
>>>>> Adrian Tanase <atan...@adobe.com>, "user @spark" <
>>>>> user@spark.apache.org>, Xiangrui Meng <men...@gmail.com>,
>>>>> hol...@pigscanfly.ca
>>>>> Subject: RE: thought experiment: use spark ML to real time prediction
>>>>>
>>>>> I am glad to see DB’s comments, make me feel I am not the only one
>>>>> facing these issues. If we are able to use MLLib to load the model in web
>>>>> applications (outside the spark cluster), that would have solved the
>>>>> issue.  I understand Spark is manly for processing big data in a
>>>>> distributed mode. But, there is no purpose in training a model using 
>>>>> MLLib,
>>>>> if we are not able to use it in applications where needs to access the
>>>>> model.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Viju
>>>>>
>>>>>
>>>>>
>>>>> *From:* DB Tsai [mailto:dbt...@dbtsai.com]
>>>>> *Sent:* Thursday, November 12, 2015 11:04 AM
>>>>> *To:* Sean Owen
>>>>> *Cc:* Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase;
>>>>> user @spark; Xiangrui Meng; hol...@pigscanfly.ca
>>>>> *Subject:* Re: thought experiment: use spark ML to real time
>>>>> prediction
>>>>>
>>>>>
>>>>>
>>>>> I think the use-case can be quick different from PMML.
>>>>>
>>>>>
>>>>>
>>>>> By having a Spark platform independent ML jar, this can empower users
>>>>> to do the following,
>>>>>
>>>>>
>>>>>
>>>>> 1) PMML doesn't contain all the models we have in mllib. Also, for a
>>>>> ML pipeline trained by Spark, most of time, PMML is not expressive enough
>>>>> to do all the transformation we have in Spark ML. As a result, if we are
>>>>> able to serialize the entire Spark ML pipeline after training, and then
>>>>> load them back in app without any Spark platform for production scorning,
>>>>> this will be very useful for production deployment of Spark ML models. The
>>>>> only issue will be if the transformer involves with shuffle, we need to
>>>>> figure out a way to handle it. When I chatted with Xiangrui about this, he
>>>>> suggested that we may tag if a transformer is shuffle ready. Currently, at
>>>>> Netflix, we are not able to use ML pipeline because of those issues, and 
>>>>> we
>>>>> have to write our own scorers in our production which is quite a 
>>>>> duplicated
>>>>> work.
>>>>>
>>>>>
>>>>>
>>>>> 2) If users can use Spark's linear algebra like vector or matrix code
>>>>> in their application, this will be very useful. This can help to share 
>>>>> code
>>>>> in Spark training pipeline and production deployment. Also, lots of good
>>>>> stuff at Spark's mllib doesn't depend on Spark platform, and people can 
>>>>> use
>>>>> them in their application without pulling lots of dependencies. In fact, 
>>>>> in
>>>>> my project, I have to copy & paste code from mllib into my project to use
>>>>> those goodies in apps.
>>>>>
>>>>>
>>>>>
>>>>> 3) Currently, mllib depends on graphx which means in graphx, there is
>>>>> no way to use mllib's vector or matrix. And
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Thanks & regards,
>>>> Nirmal
>>>>
>>>> Team Lead - WSO2 Machine Learner
>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>> Mobile: +94715779733
>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> Architect - Big Data
>> Ph: +91 99805 99458
>>
>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>> Sullivan India ICT)*
>> +++
>>
>
>

Re: thought experiment: use spark ML to real time prediction

Reply via email to