Re: thought experiment: use spark ML to real time prediction

2015-11-27 Thread Nick Pentreath
Yup, I agree that Spark (or whatever other ML system) should be focused on
model training rather than real-time scoring. And yes, in most cases
trained models easily fit on a single machine. I also agree that, while
there may be a few use cases out there, Spark Streaming is generally not
well-suited for real-time model scoring. It can be nicely suited for near
real-time model training / updating however.

The thing about models is, once they're built, they are quite standard -
hence standards for I/O and scoring such as PMML. They should ideally also
be completely portable across languages and frameworks - a model trained in
Spark should be usable in a JVM web server, a Python app, a JavaScript AWS
lambda function, etc etc.

The challenge is actually not really "prediction" - which is usually a
simple dot product or matrix operation (or tree walk, or whatever), easily
handled by whatever linear algebra library you are using. It is instead
encapsulating the entire pipeline from raw(-ish) data through
transformations to predictions. As well as versioning, performance
monitoring and online evaluation, A/B testing etc etc.

I guess the point I'm trying to make is that, while it's certainly possible
to create "non-Spark" usable models (e.g. using a spark-ml-common library
or whatever), this only solves a portion of the problem. Now, it may be a
good idea to solve that portion of the problem and leave the rest for
users' own implementation to suit their needs. But I think there is a big
missing piece here that seems like it needs to be filled in the Spark, and
general ML, community.

PMML and related projects such as OpenScoring, or projects like
PredictionIO, seek to solve the problem. PFA seems like a very interesting
potential solution, but it is very young still.

So the question to me is - what is the most efficient way to solve the
problem? I guess for now it may be either something like "spark-ml-common",
or extending PMML support (or both). Perhaps in the future something like
PFA.

It would be interesting to hear more user experiences and what they are
using for serving architectures, how they are handling model
import/export/deployment, etc.

On Sun, Nov 22, 2015 at 8:33 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Hi Nick
>
> I started this thread. IMHO we need something like spark to train our
> models. The resulting model are typically small enough to easily fit on a
> single machine. My real time production system is not built on spark. The
> real time system needs to use the model to make predictions in real time.
>
>
> User case: “high frequency stock training”. Use spark to train a model.
> There is no way I could use spark streaming in the real time production
> system. I need some way to easily move the model trained using spark to a
> non spark environment so I can make predictions in real time.
>
> “credit card Fraud detection” is another similar use case.
>
> Kind regards
>
> Andy
>
>
>
>
> From: Nick Pentreath <nick.pentre...@gmail.com>
> Date: Wednesday, November 18, 2015 at 4:03 AM
> To: DB Tsai <dbt...@dbtsai.com>
> Cc: "user @spark" <user@spark.apache.org>
>
> Subject: Re: thought experiment: use spark ML to real time prediction
>
> One such "lightweight PMML in JSON" is here -
> https://github.com/bigmlcom/json-pml. At least for the schema
> definitions. But nothing available in terms of evaluation/scoring. Perhaps
> this is something that can form a basis for such a new undertaking.
>
> I agree that distributed models are only really applicable in the case of
> massive scale factor models - and then anyway for latency purposes one
> needs to use LSH or something similar to achieve sufficiently real-time
> performance. These days one can easily spin up a single very powerful
> server to handle even very large models.
>
> On Tue, Nov 17, 2015 at 11:34 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>
>> I was thinking about to work on better version of PMML, JMML in JSON, but
>> as you said, this requires a dedicated team to define the standard which
>> will be a huge work.  However, option b) and c) still don't address the
>> distributed models issue. In fact, most of the models in production have to
>> be small enough to return the result to users within reasonable latency, so
>> I doubt how usefulness of the distributed models in real production
>> use-case. For R and Python, we can build a wrapper on-top of the
>> lightweight "spark-ml-common" project.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>> On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentrea

Re: thought experiment: use spark ML to real time prediction

2015-11-27 Thread Nick Pentreath
 you face the issue of supporting various predictors and
>>>> transformers generically, across platforms and versioning. i.e. you're
>>>> re-creating a new standard like PMML
>>>>
>>>> Option (a) is do-able, but I'm a bit concerned that it may be too
>>>> "Spark specific", or even too "Scala / Java" specific. But it is still
>>>> potentially very useful to Spark users to build this out and have a
>>>> somewhat standard production serving framework and/or library (there are
>>>> obviously existing options like PredictionIO etc).
>>>>
>>>> Option (b) is really building out the existing PMML support within
>>>> Spark, so a lot of the initial work has already been done. I know some
>>>> folks had (or have) licensing issues with some components of JPMML (e.g.
>>>> the evaluator and REST server). But perhaps the solution here is to build
>>>> an Apache2-licensed evaluator framework.
>>>>
>>>> Option (c) is obviously interesting - "let's build a better PMML (that
>>>> uses JSON or whatever instead of XML!)". But it also seems like a huge
>>>> amount of reinventing the wheel, and like any new standard would take time
>>>> to garner wide support (if at all).
>>>>
>>>> It would be really useful to start to understand what the main missing
>>>> pieces are in PMML - perhaps the lowest-hanging fruit is simply to
>>>> contribute improvements or additions to PMML.
>>>>
>>>>
>>>>
>>>> On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan <
>>>> sabarish.sasidha...@manthan.com> wrote:
>>>>
>>>>> That may not be an issue if the app using the models runs by itself
>>>>> (not bundled into an existing app), which may actually be the right way to
>>>>> design it considering separation of concerns.
>>>>>
>>>>> Regards
>>>>> Sab
>>>>>
>>>>> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai <dbt...@dbtsai.com> wrote:
>>>>>
>>>>>> This will bring the whole dependencies of spark will may break the
>>>>>> web app.
>>>>>>
>>>>>>
>>>>>> Sincerely,
>>>>>>
>>>>>> DB Tsai
>>>>>> --------------
>>>>>> Web: https://www.dbtsai.com
>>>>>> PGP Key ID: 0xAF08DF8D
>>>>>>
>>>>>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote:
>>>>>>>
>>>>>>>> I agree 100%. Making the model requires large data and many cpus.
>>>>>>>>
>>>>>>>> Using it does not.
>>>>>>>>
>>>>>>>> This is a very useful side effect of ML models.
>>>>>>>>
>>>>>>>> If mlib can't use models outside spark that's a real shame.
>>>>>>>>
>>>>>>>
>>>>>>> Well you can as mentioned earlier. You don't need Spark runtime for
>>>>>>> predictions, save the serialized model and deserialize to use. (you need
>>>>>>> the Spark Jars in the classpath though)
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>>>
>>>>>>>>
>>>>>>>>  Original message 
>>>>>>>> From: "Kothuvatiparambil, Viju" <
>>>>>>>> viju.kothuvatiparam...@bankofamerica.com>
>>>>>>>> Date: 11/12/2015 3:09 PM (GMT-05:00)
>>>>>>>> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com>
>>>>>>>> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando <
>>>>>>>> nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>,
>>>>>>>> Adrian Tanase <atan...@adobe.com>, "user @spark" <
>>>>>>>> user@spark.apache.org>, Xiangrui Meng <men...@gmail.com>,
>>>

Re: thought experiment: use spark ML to real time prediction

2015-11-22 Thread Vincenzo Selvaggio
rt within
>>> Spark, so a lot of the initial work has already been done. I know some
>>> folks had (or have) licensing issues with some components of JPMML (e.g.
>>> the evaluator and REST server). But perhaps the solution here is to build
>>> an Apache2-licensed evaluator framework.
>>>
>>> Option (c) is obviously interesting - "let's build a better PMML (that
>>> uses JSON or whatever instead of XML!)". But it also seems like a huge
>>> amount of reinventing the wheel, and like any new standard would take time
>>> to garner wide support (if at all).
>>>
>>> It would be really useful to start to understand what the main missing
>>> pieces are in PMML - perhaps the lowest-hanging fruit is simply to
>>> contribute improvements or additions to PMML.
>>>
>>>
>>>
>>> On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan <
>>> sabarish.sasidha...@manthan.com> wrote:
>>>
>>>> That may not be an issue if the app using the models runs by itself
>>>> (not bundled into an existing app), which may actually be the right way to
>>>> design it considering separation of concerns.
>>>>
>>>> Regards
>>>> Sab
>>>>
>>>> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai <dbt...@dbtsai.com> wrote:
>>>>
>>>>> This will bring the whole dependencies of spark will may break the web
>>>>> app.
>>>>>
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> DB Tsai
>>>>> --
>>>>> Web: https://www.dbtsai.com
>>>>> PGP Key ID: 0xAF08DF8D
>>>>>
>>>>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote:
>>>>>>
>>>>>>> I agree 100%. Making the model requires large data and many cpus.
>>>>>>>
>>>>>>> Using it does not.
>>>>>>>
>>>>>>> This is a very useful side effect of ML models.
>>>>>>>
>>>>>>> If mlib can't use models outside spark that's a real shame.
>>>>>>>
>>>>>>
>>>>>> Well you can as mentioned earlier. You don't need Spark runtime for
>>>>>> predictions, save the serialized model and deserialize to use. (you need
>>>>>> the Spark Jars in the classpath though)
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>>
>>>>>>>
>>>>>>>  Original message 
>>>>>>> From: "Kothuvatiparambil, Viju" <
>>>>>>> viju.kothuvatiparam...@bankofamerica.com>
>>>>>>> Date: 11/12/2015 3:09 PM (GMT-05:00)
>>>>>>> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com>
>>>>>>> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando <
>>>>>>> nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>,
>>>>>>> Adrian Tanase <atan...@adobe.com>, "user @spark" <
>>>>>>> user@spark.apache.org>, Xiangrui Meng <men...@gmail.com>,
>>>>>>> hol...@pigscanfly.ca
>>>>>>> Subject: RE: thought experiment: use spark ML to real time
>>>>>>> prediction
>>>>>>>
>>>>>>> I am glad to see DB’s comments, make me feel I am not the only one
>>>>>>> facing these issues. If we are able to use MLLib to load the model in 
>>>>>>> web
>>>>>>> applications (outside the spark cluster), that would have solved the
>>>>>>> issue.  I understand Spark is manly for processing big data in a
>>>>>>> distributed mode. But, there is no purpose in training a model using 
>>>>>>> MLLib,
>>>>>>> if we are not able to use it in applications where needs to access the
>>>>>>> model.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>

Re: thought experiment: use spark ML to real time prediction

2015-11-22 Thread Andy Davidson
Hi Nick

I started this thread. IMHO we need something like spark to train our
models. The resulting model are typically small enough to easily fit on a
single machine. My real time production system is not built on spark. The
real time system needs to use the model to make predictions in real time.


User case: ³high frequency stock training². Use spark to train a model.
There is no way I could use spark streaming in the real time production
system. I need some way to easily move the model trained using spark to a
non spark environment so I can make predictions in real time.

³credit card Fraud detection² is another similar use case.

Kind regards

Andy




From:  Nick Pentreath <nick.pentre...@gmail.com>
Date:  Wednesday, November 18, 2015 at 4:03 AM
To:  DB Tsai <dbt...@dbtsai.com>
Cc:  "user @spark" <user@spark.apache.org>
Subject:  Re: thought experiment: use spark ML to real time prediction

> One such "lightweight PMML in JSON" is here -
> https://github.com/bigmlcom/json-pml. At least for the schema definitions. But
> nothing available in terms of evaluation/scoring. Perhaps this is something
> that can form a basis for such a new undertaking.
> 
> I agree that distributed models are only really applicable in the case of
> massive scale factor models - and then anyway for latency purposes one needs
> to use LSH or something similar to achieve sufficiently real-time performance.
> These days one can easily spin up a single very powerful server to handle even
> very large models.
> 
> On Tue, Nov 17, 2015 at 11:34 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>> I was thinking about to work on better version of PMML, JMML in JSON, but as
>> you said, this requires a dedicated team to define the standard which will be
>> a huge work.  However, option b) and c) still don't address the distributed
>> models issue. In fact, most of the models in production have to be small
>> enough to return the result to users within reasonable latency, so I doubt
>> how usefulness of the distributed models in real production use-case. For R
>> and Python, we can build a wrapper on-top of the lightweight
>> "spark-ml-common" project.
>> 
>> 
>> Sincerely,
>> 
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>> 
>> On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentreath <nick.pentre...@gmail.com>
>> wrote:
>>> I think the issue with pulling in all of spark-core is often with
>>> dependencies (and versions) conflicting with the web framework (or Akka in
>>> many cases). Plus it really is quite heavy if you just want a fairly
>>> lightweight model-serving app. For example we've built a fairly simple but
>>> scalable ALS factor model server on Scalatra, Akka and Breeze. So all you
>>> really need is the web framework and Breeze (or an alternative linear
>>> algebra lib).
>>> 
>>> I definitely hear the pain-point that PMML might not be able to handle some
>>> types of transformations or models that exist in Spark. However, here's an
>>> example from scikit-learn -> PMML that may be instructive
>>> (https://github.com/scikit-learn/scikit-learn/issues/1596 and
>>> https://github.com/jpmml/jpmml-sklearn), where a fairly impressive list of
>>> estimators and transformers are supported (including e.g. scaling and
>>> encoding, and PCA).
>>> 
>>> I definitely think the current model I/O and "export" or "deploy to
>>> production" situation needs to be improved substantially. However, you are
>>> left with the following options:
>>> 
>>> (a) build out a lightweight "spark-ml-common" project that brings in the
>>> dependencies needed for production scoring / transformation in independent
>>> apps. However, here you only support Scala/Java - what about R and Python?
>>> Also, what about the distributed models? Perhaps "local" wrappers can be
>>> created, though this may not work for very large factor or LDA models. See
>>> also H20 example http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html
>>> 
>>> (b) build out Spark's PMML support, and add missing stuff to PMML where
>>> possible. The benefit here is an existing standard with various tools for
>>> scoring (via REST server, Java app, Pig, Hive, various language support).
>>> 
>>> (c) build out a more comprehensive I/O, serialization and scoring framework.
>>> Here you face the issue of supporting various predictors and transformers
>>> generically, across platfo

Re: thought experiment: use spark ML to real time prediction

2015-11-18 Thread Nick Pentreath
ally useful to start to understand what the main missing
>> pieces are in PMML - perhaps the lowest-hanging fruit is simply to
>> contribute improvements or additions to PMML.
>>
>>
>>
>> On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>>> That may not be an issue if the app using the models runs by itself (not
>>> bundled into an existing app), which may actually be the right way to
>>> design it considering separation of concerns.
>>>
>>> Regards
>>> Sab
>>>
>>> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai <dbt...@dbtsai.com> wrote:
>>>
>>>> This will bring the whole dependencies of spark will may break the web
>>>> app.
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> DB Tsai
>>>> --
>>>> Web: https://www.dbtsai.com
>>>> PGP Key ID: 0xAF08DF8D
>>>>
>>>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote:
>>>>>
>>>>>> I agree 100%. Making the model requires large data and many cpus.
>>>>>>
>>>>>> Using it does not.
>>>>>>
>>>>>> This is a very useful side effect of ML models.
>>>>>>
>>>>>> If mlib can't use models outside spark that's a real shame.
>>>>>>
>>>>>
>>>>> Well you can as mentioned earlier. You don't need Spark runtime for
>>>>> predictions, save the serialized model and deserialize to use. (you need
>>>>> the Spark Jars in the classpath though)
>>>>>
>>>>>>
>>>>>>
>>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>>
>>>>>>
>>>>>>  Original message 
>>>>>> From: "Kothuvatiparambil, Viju" <
>>>>>> viju.kothuvatiparam...@bankofamerica.com>
>>>>>> Date: 11/12/2015 3:09 PM (GMT-05:00)
>>>>>> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com>
>>>>>> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando <
>>>>>> nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>,
>>>>>> Adrian Tanase <atan...@adobe.com>, "user @spark" <
>>>>>> user@spark.apache.org>, Xiangrui Meng <men...@gmail.com>,
>>>>>> hol...@pigscanfly.ca
>>>>>> Subject: RE: thought experiment: use spark ML to real time prediction
>>>>>>
>>>>>> I am glad to see DB’s comments, make me feel I am not the only one
>>>>>> facing these issues. If we are able to use MLLib to load the model in web
>>>>>> applications (outside the spark cluster), that would have solved the
>>>>>> issue.  I understand Spark is manly for processing big data in a
>>>>>> distributed mode. But, there is no purpose in training a model using 
>>>>>> MLLib,
>>>>>> if we are not able to use it in applications where needs to access the
>>>>>> model.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Viju
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* DB Tsai [mailto:dbt...@dbtsai.com]
>>>>>> *Sent:* Thursday, November 12, 2015 11:04 AM
>>>>>> *To:* Sean Owen
>>>>>> *Cc:* Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase;
>>>>>> user @spark; Xiangrui Meng; hol...@pigscanfly.ca
>>>>>> *Subject:* Re: thought experiment: use spark ML to real time
>>>>>> prediction
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think the use-case can be quick different from PMML.
>>>>>>
>>>>>>
>>>>>>
>>>>>> By having a Spark platform independent ML jar, this can empower users
>>>>>> to do the following,
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1) PMML doesn't contain all the models we have in mllib. Also, for a
>>>>>> ML pipeline trained by Spark, most of time, PMML is not expressive enough
>>>>>> to do all the transformation we have in Spark ML. As a result, if we are
>>>>>> able to serialize the entire Spark ML pipeline after training, and then
>>>>>> load them back in app without any Spark platform for production scorning,
>>>>>> this will be very useful for production deployment of Spark ML models. 
>>>>>> The
>>>>>> only issue will be if the transformer involves with shuffle, we need to
>>>>>> figure out a way to handle it. When I chatted with Xiangrui about this, 
>>>>>> he
>>>>>> suggested that we may tag if a transformer is shuffle ready. Currently, 
>>>>>> at
>>>>>> Netflix, we are not able to use ML pipeline because of those issues, and 
>>>>>> we
>>>>>> have to write our own scorers in our production which is quite a 
>>>>>> duplicated
>>>>>> work.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2) If users can use Spark's linear algebra like vector or matrix code
>>>>>> in their application, this will be very useful. This can help to share 
>>>>>> code
>>>>>> in Spark training pipeline and production deployment. Also, lots of good
>>>>>> stuff at Spark's mllib doesn't depend on Spark platform, and people can 
>>>>>> use
>>>>>> them in their application without pulling lots of dependencies. In fact, 
>>>>>> in
>>>>>> my project, I have to copy & paste code from mllib into my project to use
>>>>>> those goodies in apps.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 3) Currently, mllib depends on graphx which means in graphx, there is
>>>>>> no way to use mllib's vector or matrix. And
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Thanks & regards,
>>>>> Nirmal
>>>>>
>>>>> Team Lead - WSO2 Machine Learner
>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>> Mobile: +94715779733
>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Architect - Big Data
>>> Ph: +91 99805 99458
>>>
>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>>> Sullivan India ICT)*
>>> +++
>>>
>>
>>
>


Re: thought experiment: use spark ML to real time prediction

2015-11-17 Thread Nick Pentreath
I think the issue with pulling in all of spark-core is often with
dependencies (and versions) conflicting with the web framework (or Akka in
many cases). Plus it really is quite heavy if you just want a fairly
lightweight model-serving app. For example we've built a fairly simple but
scalable ALS factor model server on Scalatra, Akka and Breeze. So all you
really need is the web framework and Breeze (or an alternative linear
algebra lib).

I definitely hear the pain-point that PMML might not be able to handle some
types of transformations or models that exist in Spark. However, here's an
example from scikit-learn -> PMML that may be instructive (
https://github.com/scikit-learn/scikit-learn/issues/1596 and
https://github.com/jpmml/jpmml-sklearn), where a fairly impressive list of
estimators and transformers are supported (including e.g. scaling and
encoding, and PCA).

I definitely think the current model I/O and "export" or "deploy to
production" situation needs to be improved substantially. However, you are
left with the following options:

(a) build out a lightweight "spark-ml-common" project that brings in the
dependencies needed for production scoring / transformation in independent
apps. However, here you only support Scala/Java - what about R and Python?
Also, what about the distributed models? Perhaps "local" wrappers can be
created, though this may not work for very large factor or LDA models. See
also H20 example http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html

(b) build out Spark's PMML support, and add missing stuff to PMML where
possible. The benefit here is an existing standard with various tools for
scoring (via REST server, Java app, Pig, Hive, various language support).

(c) build out a more comprehensive I/O, serialization and scoring
framework. Here you face the issue of supporting various predictors and
transformers generically, across platforms and versioning. i.e. you're
re-creating a new standard like PMML

Option (a) is do-able, but I'm a bit concerned that it may be too "Spark
specific", or even too "Scala / Java" specific. But it is still potentially
very useful to Spark users to build this out and have a somewhat standard
production serving framework and/or library (there are obviously existing
options like PredictionIO etc).

Option (b) is really building out the existing PMML support within Spark,
so a lot of the initial work has already been done. I know some folks had
(or have) licensing issues with some components of JPMML (e.g. the
evaluator and REST server). But perhaps the solution here is to build an
Apache2-licensed evaluator framework.

Option (c) is obviously interesting - "let's build a better PMML (that uses
JSON or whatever instead of XML!)". But it also seems like a huge amount of
reinventing the wheel, and like any new standard would take time to garner
wide support (if at all).

It would be really useful to start to understand what the main missing
pieces are in PMML - perhaps the lowest-hanging fruit is simply to
contribute improvements or additions to PMML.



On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> That may not be an issue if the app using the models runs by itself (not
> bundled into an existing app), which may actually be the right way to
> design it considering separation of concerns.
>
> Regards
> Sab
>
> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai <dbt...@dbtsai.com> wrote:
>
>> This will bring the whole dependencies of spark will may break the web
>> app.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com> wrote:
>>
>>>
>>>
>>> On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote:
>>>
>>>> I agree 100%. Making the model requires large data and many cpus.
>>>>
>>>> Using it does not.
>>>>
>>>> This is a very useful side effect of ML models.
>>>>
>>>> If mlib can't use models outside spark that's a real shame.
>>>>
>>>
>>> Well you can as mentioned earlier. You don't need Spark runtime for
>>> predictions, save the serialized model and deserialize to use. (you need
>>> the Spark Jars in the classpath though)
>>>
>>>>
>>>>
>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>
>>>>
>>>>  Original message 
>>>> From: "Kothuvatiparambil, Viju" <
>>>> viju.kothuvatiparam...@bankofamerica.com>
>>>&

Re: thought experiment: use spark ML to real time prediction

2015-11-17 Thread DB Tsai
Key ID: 0xAF08DF8D
>>>
>>> On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote:
>>>>
>>>>> I agree 100%. Making the model requires large data and many cpus.
>>>>>
>>>>> Using it does not.
>>>>>
>>>>> This is a very useful side effect of ML models.
>>>>>
>>>>> If mlib can't use models outside spark that's a real shame.
>>>>>
>>>>
>>>> Well you can as mentioned earlier. You don't need Spark runtime for
>>>> predictions, save the serialized model and deserialize to use. (you need
>>>> the Spark Jars in the classpath though)
>>>>
>>>>>
>>>>>
>>>>> Sent from my Verizon Wireless 4G LTE smartphone
>>>>>
>>>>>
>>>>>  Original message 
>>>>> From: "Kothuvatiparambil, Viju" <
>>>>> viju.kothuvatiparam...@bankofamerica.com>
>>>>> Date: 11/12/2015 3:09 PM (GMT-05:00)
>>>>> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com>
>>>>> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando <
>>>>> nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>,
>>>>> Adrian Tanase <atan...@adobe.com>, "user @spark" <
>>>>> user@spark.apache.org>, Xiangrui Meng <men...@gmail.com>,
>>>>> hol...@pigscanfly.ca
>>>>> Subject: RE: thought experiment: use spark ML to real time prediction
>>>>>
>>>>> I am glad to see DB’s comments, make me feel I am not the only one
>>>>> facing these issues. If we are able to use MLLib to load the model in web
>>>>> applications (outside the spark cluster), that would have solved the
>>>>> issue.  I understand Spark is manly for processing big data in a
>>>>> distributed mode. But, there is no purpose in training a model using 
>>>>> MLLib,
>>>>> if we are not able to use it in applications where needs to access the
>>>>> model.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>> Viju
>>>>>
>>>>>
>>>>>
>>>>> *From:* DB Tsai [mailto:dbt...@dbtsai.com]
>>>>> *Sent:* Thursday, November 12, 2015 11:04 AM
>>>>> *To:* Sean Owen
>>>>> *Cc:* Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase;
>>>>> user @spark; Xiangrui Meng; hol...@pigscanfly.ca
>>>>> *Subject:* Re: thought experiment: use spark ML to real time
>>>>> prediction
>>>>>
>>>>>
>>>>>
>>>>> I think the use-case can be quick different from PMML.
>>>>>
>>>>>
>>>>>
>>>>> By having a Spark platform independent ML jar, this can empower users
>>>>> to do the following,
>>>>>
>>>>>
>>>>>
>>>>> 1) PMML doesn't contain all the models we have in mllib. Also, for a
>>>>> ML pipeline trained by Spark, most of time, PMML is not expressive enough
>>>>> to do all the transformation we have in Spark ML. As a result, if we are
>>>>> able to serialize the entire Spark ML pipeline after training, and then
>>>>> load them back in app without any Spark platform for production scorning,
>>>>> this will be very useful for production deployment of Spark ML models. The
>>>>> only issue will be if the transformer involves with shuffle, we need to
>>>>> figure out a way to handle it. When I chatted with Xiangrui about this, he
>>>>> suggested that we may tag if a transformer is shuffle ready. Currently, at
>>>>> Netflix, we are not able to use ML pipeline because of those issues, and 
>>>>> we
>>>>> have to write our own scorers in our production which is quite a 
>>>>> duplicated
>>>>> work.
>>>>>
>>>>>
>>>>>
>>>>> 2) If users can use Spark's linear algebra like vector or matrix code
>>>>> in their application, this will be very useful. This can help to share 
>>>>> code
>>>>> in Spark training pipeline and production deployment. Also, lots of good
>>>>> stuff at Spark's mllib doesn't depend on Spark platform, and people can 
>>>>> use
>>>>> them in their application without pulling lots of dependencies. In fact, 
>>>>> in
>>>>> my project, I have to copy & paste code from mllib into my project to use
>>>>> those goodies in apps.
>>>>>
>>>>>
>>>>>
>>>>> 3) Currently, mllib depends on graphx which means in graphx, there is
>>>>> no way to use mllib's vector or matrix. And
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Thanks & regards,
>>>> Nirmal
>>>>
>>>> Team Lead - WSO2 Machine Learner
>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>> Mobile: +94715779733
>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> Architect - Big Data
>> Ph: +91 99805 99458
>>
>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>> Sullivan India ICT)*
>> +++
>>
>
>


RE: thought experiment: use spark ML to real time prediction

2015-11-12 Thread Kothuvatiparambil, Viju
I am glad to see DB’s comments, make me feel I am not the only one facing these 
issues. If we are able to use MLLib to load the model in web applications 
(outside the spark cluster), that would have solved the issue.  I understand 
Spark is manly for processing big data in a distributed mode. But, there is no 
purpose in training a model using MLLib, if we are not able to use it in 
applications where needs to access the model.

Thanks
Viju

From: DB Tsai [mailto:dbt...@dbtsai.com]
Sent: Thursday, November 12, 2015 11:04 AM
To: Sean Owen
Cc: Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; user @spark; 
Xiangrui Meng; hol...@pigscanfly.ca
Subject: Re: thought experiment: use spark ML to real time prediction

I think the use-case can be quick different from PMML.

By having a Spark platform independent ML jar, this can empower users to do the 
following,

1) PMML doesn't contain all the models we have in mllib. Also, for a ML 
pipeline trained by Spark, most of time, PMML is not expressive enough to do 
all the transformation we have in Spark ML. As a result, if we are able to 
serialize the entire Spark ML pipeline after training, and then load them back 
in app without any Spark platform for production scorning, this will be very 
useful for production deployment of Spark ML models. The only issue will be if 
the transformer involves with shuffle, we need to figure out a way to handle 
it. When I chatted with Xiangrui about this, he suggested that we may tag if a 
transformer is shuffle ready. Currently, at Netflix, we are not able to use ML 
pipeline because of those issues, and we have to write our own scorers in our 
production which is quite a duplicated work.

2) If users can use Spark's linear algebra like vector or matrix code in their 
application, this will be very useful. This can help to share code in Spark 
training pipeline and production deployment. Also, lots of good stuff at 
Spark's mllib doesn't depend on Spark platform, and people can use them in 
their application without pulling lots of dependencies. In fact, in my project, 
I have to copy & paste code from mllib into my project to use those goodies in 
apps.

3) Currently, mllib depends on graphx which means in graphx, there is no way to 
use mllib's vector or matrix. And at Netflix, we implemented parallel 
personalized page rank which requires to use sparse vector as part of public 
api. We have to use breeze here since no access to mllib's basic type in 
graphx. Before we contribute it back to open source community, we need to 
address this.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Thu, Nov 12, 2015 at 3:42 AM, Sean Owen 
<so...@cloudera.com<mailto:so...@cloudera.com>> wrote:
This is all starting to sound a lot like what's already implemented in 
Java-based PMML parsing/scoring libraries like JPMML and OpenScoring. I'm not 
clear it helps a lot to reimplement this in Spark.

On Thu, Nov 12, 2015 at 8:05 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
+1 on that. It would be useful to use the model outside of Spark.

_
From: DB Tsai <dbt...@dbtsai.com<mailto:dbt...@dbtsai.com>>
Sent: Wednesday, November 11, 2015 11:57 PM
Subject: Re: thought experiment: use spark ML to real time prediction
To: Nirmal Fernando <nir...@wso2.com<mailto:nir...@wso2.com>>
Cc: Andy Davidson 
<a...@santacruzintegration.com<mailto:a...@santacruzintegration.com>>, Adrian 
Tanase <atan...@adobe.com<mailto:atan...@adobe.com>>, user @spark 
<user@spark.apache.org<mailto:user@spark.apache.org>>


Do you think it will be useful to separate those models and model loader/writer 
code into another spark-ml-common jar without any spark platform dependencies 
so users can load the models trained by Spark ML in their application and run 
the prediction?


Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Wed, Nov 11, 2015 at 3:14 AM, Nirmal Fernando 
<nir...@wso2.com<mailto:nir...@wso2.com>> wrote:
As of now, we are basically serializing the ML model and then deserialize it 
for prediction at real time.

On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase 
<atan...@adobe.com<mailto:atan...@adobe.com>> wrote:
I don’t think this answers your question but here’s how you would evaluate the 
model in realtime in a streaming app
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html

Maybe you can find a way to extract portions of MLLib and run them outside of 
spark – loading the precomputed model and calling .predict on it…

-adrian

From: Andy Davidson
Date: Tuesday, November 10, 2015 at 11:31 PM
To: "user @spark"
Subject: thought experimen

RE: thought experiment: use spark ML to real time prediction

2015-11-12 Thread darren


I agree 100%. Making the model requires large data and many cpus.
Using it does not.
This is a very useful side effect of ML models.
If mlib can't use models outside spark that's a real shame.

Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: "Kothuvatiparambil, Viju" <viju.kothuvatiparam...@bankofamerica.com> 
Date: 11/12/2015  3:09 PM  (GMT-05:00) 
To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com> 
Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando 
<nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>, Adrian Tanase 
<atan...@adobe.com>, "user @spark" <user@spark.apache.org>, Xiangrui Meng 
<men...@gmail.com>, hol...@pigscanfly.ca 
Subject: RE: thought experiment: use spark ML to real time prediction 









I am glad to see DB’s comments, make me feel I am not the only one facing these 
issues. If we are able to use MLLib to load the model in web applications 
(outside
 the spark cluster), that would have solved the issue.  I understand Spark is 
manly for processing big data in a distributed mode. But, there is no purpose 
in training a model using MLLib, if we are not able to use it in applications 
where needs to access the
 model.  
 
Thanks
Viju
 
From: DB Tsai [mailto:dbt...@dbtsai.com]


Sent: Thursday, November 12, 2015 11:04 AM

To: Sean Owen

Cc: Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; user @spark; 
Xiangrui Meng; hol...@pigscanfly.ca

Subject: Re: thought experiment: use spark ML to real time prediction
 

I think the use-case can be quick different from PMML. 

 


By having a Spark platform independent ML jar, this can empower users to do the 
following,


 


1) PMML doesn't contain all the models we have in mllib. Also, for a ML 
pipeline trained by Spark, most of time, PMML is not expressive enough to do 
all the transformation we have in Spark ML. As a result, if we are able to 
serialize the
 entire Spark ML pipeline after training, and then load them back in app 
without any Spark platform for production scorning, this will be very useful 
for production deployment of Spark ML models. The only issue will be if the 
transformer involves with shuffle,
 we need to figure out a way to handle it. When I chatted with Xiangrui about 
this, he suggested that we may tag if a transformer is shuffle ready. 
Currently, at Netflix, we are not able to use ML pipeline because of those 
issues, and we have to write our own
 scorers in our production which is quite a duplicated work.


 


2) If users can use Spark's linear algebra like vector or matrix code in their 
application, this will be very useful. This can help to share code in Spark 
training pipeline and production deployment. Also, lots of good stuff at Spark's
 mllib doesn't depend on Spark platform, and people can use them in their 
application without pulling lots of dependencies. In fact, in my project, I 
have to copy & paste code from mllib into my project to use those goodies in 
apps.


 


3) Currently, mllib depends on graphx which means in graphx, there is no way to 
use mllib's vector or matrix. And

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread Felix Cheung
+1 on that. It would be useful to use the model outside of Spark.



_
From: DB Tsai <dbt...@dbtsai.com>
Sent: Wednesday, November 11, 2015 11:57 PM
Subject: Re: thought experiment: use spark ML to real time prediction
To: Nirmal Fernando <nir...@wso2.com>
Cc: Andy Davidson <a...@santacruzintegration.com>, Adrian Tanase 
<atan...@adobe.com>, user @spark <user@spark.apache.org>


   Do you think it will be useful to separate those models and model 
loader/writer code into another spark-ml-common jar without any spark platform 
dependencies so users can load the models trained by Spark ML in their 
application and run the prediction?   

Sincerely, 
 
DB Tsai 
-- 
Web:  https://www.dbtsai.com 
PGP Key ID: 0xAF08DF8D   
   On Wed, Nov 11, 2015 at 3:14 AM, Nirmal Fernando <nir...@wso2.com> 
wrote:
   As of now, we are basically serializing the ML model and then 
deserialize it for prediction at real time.   
 On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase  
<atan...@adobe.com> wrote: 
 I 
don’t think this answers your question but here’s how you would evaluate the 
model in realtime in a streaming app 
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html

 
Maybe you can find a way to extract 
portions of MLLib and run them outside of spark – loading the precomputed model 
and calling .predict on it…   
-adrian   
   From: Andy Davidson  
   
  Date: Tuesday, November 10, 2015 at 11:31 PM     
 To: "user @spark"
 Subject: thought experiment: use spark ML to real time prediction


Lets say I have use spark ML to train a linear model. I know I can save and 
load the model to disk. I am not sure how I can use the model in a real time 
environment. For example I do not think I can return a “prediction” to the 
client using spark streaming easily. Also for some applications the extra 
latency created by the batch process might not be acceptable.   
   
   If I was not using spark I 
would re-implement the model I trained in my batch environment in a lang like 
Java  and implement a rest service that uses the model to create a prediction 
and return the prediction to the client. Many models make predictions using 
linear algebra. Implementing predictions is relatively easy if you have a good 
vectorized LA package. Is there a way to use a model I trained using spark ML 
outside of spark?  
   As a motivating example, 
even if its possible to return data to the client using spark streaming. I 
think the mini batch latency would not be acceptable for a high frequency stock 
trading system.  
   Kind regards 
 
   Andy 
 
   P.s. The examples I have 
seen so far use spark streaming to “preprocess” predictions. For example a 
recommender system might use what current users are watching to calculate 
“trending recommendations”. These are stored on disk and served up to users 
when the use the “movie guide”. If a recommendation was a couple of min. old it 
would not effect the end users experience.  


   

 
   -- 
 
Thanks & regards,  
Nirmal  
  
Team Lead - WSO2 Machine Learner  
Associate Technical Lead - Data Technologies Team, WSO2 Inc.  
Mobile:   +94715779733  
Blog:   http://nirmalfdo.blogspot.com/

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread Nick Pentreath
Yup, currently PMML export, or Java serialization, are the options
realistically available.

Though PMML may deter some, there are not many viable cross-platform
alternatives (with nearly as much coverage).

On Thu, Nov 12, 2015 at 1:42 PM, Sean Owen <so...@cloudera.com> wrote:

> This is all starting to sound a lot like what's already implemented in
> Java-based PMML parsing/scoring libraries like JPMML and OpenScoring. I'm
> not clear it helps a lot to reimplement this in Spark.
>
> On Thu, Nov 12, 2015 at 8:05 AM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> +1 on that. It would be useful to use the model outside of Spark.
>>
>>
>> _
>> From: DB Tsai <dbt...@dbtsai.com>
>> Sent: Wednesday, November 11, 2015 11:57 PM
>> Subject: Re: thought experiment: use spark ML to real time prediction
>> To: Nirmal Fernando <nir...@wso2.com>
>> Cc: Andy Davidson <a...@santacruzintegration.com>, Adrian Tanase <
>> atan...@adobe.com>, user @spark <user@spark.apache.org>
>>
>>
>>
>> Do you think it will be useful to separate those models and model
>> loader/writer code into another spark-ml-common jar without any spark
>> platform dependencies so users can load the models trained by Spark ML in
>> their application and run the prediction?
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>> On Wed, Nov 11, 2015 at 3:14 AM, Nirmal Fernando <nir...@wso2.com>
>> wrote:
>>
>>> As of now, we are basically serializing the ML model and then
>>> deserialize it for prediction at real time.
>>>
>>> On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase <atan...@adobe.com>
>>> wrote:
>>>
>>>> I don’t think this answers your question but here’s how you would
>>>> evaluate the model in realtime in a streaming app
>>>>
>>>> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html
>>>>
>>>> Maybe you can find a way to extract portions of MLLib and run them
>>>> outside of spark – loading the precomputed model and calling .predict on
>>>> it…
>>>>
>>>> -adrian
>>>>
>>>> From: Andy Davidson
>>>> Date: Tuesday, November 10, 2015 at 11:31 PM
>>>> To: "user @spark"
>>>> Subject: thought experiment: use spark ML to real time prediction
>>>>
>>>> Lets say I have use spark ML to train a linear model. I know I can save
>>>> and load the model to disk. I am not sure how I can use the model in a real
>>>> time environment. For example I do not think I can return a “prediction” to
>>>> the client using spark streaming easily. Also for some applications the
>>>> extra latency created by the batch process might not be acceptable.
>>>>
>>>> If I was not using spark I would re-implement the model I trained in my
>>>> batch environment in a lang like Java  and implement a rest service that
>>>> uses the model to create a prediction and return the prediction to the
>>>> client. Many models make predictions using linear algebra. Implementing
>>>> predictions is relatively easy if you have a good vectorized LA package. Is
>>>> there a way to use a model I trained using spark ML outside of spark?
>>>>
>>>> As a motivating example, even if its possible to return data to the
>>>> client using spark streaming. I think the mini batch latency would not be
>>>> acceptable for a high frequency stock trading system.
>>>>
>>>> Kind regards
>>>>
>>>> Andy
>>>>
>>>> P.s. The examples I have seen so far use spark streaming to
>>>> “preprocess” predictions. For example a recommender system might use what
>>>> current users are watching to calculate “trending recommendations”. These
>>>> are stored on disk and served up to users when the use the “movie guide”.
>>>> If a recommendation was a couple of min. old it would not effect the end
>>>> users experience.
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Thanks & regards,
>>> Nirmal
>>>
>>> Team Lead - WSO2 Machine Learner
>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>> Mobile: +94715779733
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>>
>>>
>>
>>
>>
>


Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread DB Tsai
This will bring the whole dependencies of spark will may break the web app.


Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando <nir...@wso2.com> wrote:

>
>
> On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote:
>
>> I agree 100%. Making the model requires large data and many cpus.
>>
>> Using it does not.
>>
>> This is a very useful side effect of ML models.
>>
>> If mlib can't use models outside spark that's a real shame.
>>
>
> Well you can as mentioned earlier. You don't need Spark runtime for
> predictions, save the serialized model and deserialize to use. (you need
> the Spark Jars in the classpath though)
>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>>  Original message 
>> From: "Kothuvatiparambil, Viju" <viju.kothuvatiparam...@bankofamerica.com>
>>
>> Date: 11/12/2015 3:09 PM (GMT-05:00)
>> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com>
>> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando <
>> nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>, Adrian
>> Tanase <atan...@adobe.com>, "user @spark" <user@spark.apache.org>,
>> Xiangrui Meng <men...@gmail.com>, hol...@pigscanfly.ca
>> Subject: RE: thought experiment: use spark ML to real time prediction
>>
>> I am glad to see DB’s comments, make me feel I am not the only one facing
>> these issues. If we are able to use MLLib to load the model in web
>> applications (outside the spark cluster), that would have solved the
>> issue.  I understand Spark is manly for processing big data in a
>> distributed mode. But, there is no purpose in training a model using MLLib,
>> if we are not able to use it in applications where needs to access the
>> model.
>>
>>
>>
>> Thanks
>>
>> Viju
>>
>>
>>
>> *From:* DB Tsai [mailto:dbt...@dbtsai.com]
>> *Sent:* Thursday, November 12, 2015 11:04 AM
>> *To:* Sean Owen
>> *Cc:* Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; user
>> @spark; Xiangrui Meng; hol...@pigscanfly.ca
>> *Subject:* Re: thought experiment: use spark ML to real time prediction
>>
>>
>>
>> I think the use-case can be quick different from PMML.
>>
>>
>>
>> By having a Spark platform independent ML jar, this can empower users to
>> do the following,
>>
>>
>>
>> 1) PMML doesn't contain all the models we have in mllib. Also, for a ML
>> pipeline trained by Spark, most of time, PMML is not expressive enough to
>> do all the transformation we have in Spark ML. As a result, if we are able
>> to serialize the entire Spark ML pipeline after training, and then load
>> them back in app without any Spark platform for production scorning, this
>> will be very useful for production deployment of Spark ML models. The only
>> issue will be if the transformer involves with shuffle, we need to figure
>> out a way to handle it. When I chatted with Xiangrui about this, he
>> suggested that we may tag if a transformer is shuffle ready. Currently, at
>> Netflix, we are not able to use ML pipeline because of those issues, and we
>> have to write our own scorers in our production which is quite a duplicated
>> work.
>>
>>
>>
>> 2) If users can use Spark's linear algebra like vector or matrix code in
>> their application, this will be very useful. This can help to share code in
>> Spark training pipeline and production deployment. Also, lots of good stuff
>> at Spark's mllib doesn't depend on Spark platform, and people can use them
>> in their application without pulling lots of dependencies. In fact, in my
>> project, I have to copy & paste code from mllib into my project to use
>> those goodies in apps.
>>
>>
>>
>> 3) Currently, mllib depends on graphx which means in graphx, there is no
>> way to use mllib's vector or matrix. And
>>
>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread Nirmal Fernando
On Fri, Nov 13, 2015 at 2:04 AM, darren <dar...@ontrenet.com> wrote:

> I agree 100%. Making the model requires large data and many cpus.
>
> Using it does not.
>
> This is a very useful side effect of ML models.
>
> If mlib can't use models outside spark that's a real shame.
>

Well you can as mentioned earlier. You don't need Spark runtime for
predictions, save the serialized model and deserialize to use. (you need
the Spark Jars in the classpath though)

>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
>  Original message 
> From: "Kothuvatiparambil, Viju" <viju.kothuvatiparam...@bankofamerica.com>
>
> Date: 11/12/2015 3:09 PM (GMT-05:00)
> To: DB Tsai <dbt...@dbtsai.com>, Sean Owen <so...@cloudera.com>
> Cc: Felix Cheung <felixcheun...@hotmail.com>, Nirmal Fernando <
> nir...@wso2.com>, Andy Davidson <a...@santacruzintegration.com>, Adrian
> Tanase <atan...@adobe.com>, "user @spark" <user@spark.apache.org>,
> Xiangrui Meng <men...@gmail.com>, hol...@pigscanfly.ca
> Subject: RE: thought experiment: use spark ML to real time prediction
>
> I am glad to see DB’s comments, make me feel I am not the only one facing
> these issues. If we are able to use MLLib to load the model in web
> applications (outside the spark cluster), that would have solved the
> issue.  I understand Spark is manly for processing big data in a
> distributed mode. But, there is no purpose in training a model using MLLib,
> if we are not able to use it in applications where needs to access the
> model.
>
>
>
> Thanks
>
> Viju
>
>
>
> *From:* DB Tsai [mailto:dbt...@dbtsai.com]
> *Sent:* Thursday, November 12, 2015 11:04 AM
> *To:* Sean Owen
> *Cc:* Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; user
> @spark; Xiangrui Meng; hol...@pigscanfly.ca
> *Subject:* Re: thought experiment: use spark ML to real time prediction
>
>
>
> I think the use-case can be quick different from PMML.
>
>
>
> By having a Spark platform independent ML jar, this can empower users to
> do the following,
>
>
>
> 1) PMML doesn't contain all the models we have in mllib. Also, for a ML
> pipeline trained by Spark, most of time, PMML is not expressive enough to
> do all the transformation we have in Spark ML. As a result, if we are able
> to serialize the entire Spark ML pipeline after training, and then load
> them back in app without any Spark platform for production scorning, this
> will be very useful for production deployment of Spark ML models. The only
> issue will be if the transformer involves with shuffle, we need to figure
> out a way to handle it. When I chatted with Xiangrui about this, he
> suggested that we may tag if a transformer is shuffle ready. Currently, at
> Netflix, we are not able to use ML pipeline because of those issues, and we
> have to write our own scorers in our production which is quite a duplicated
> work.
>
>
>
> 2) If users can use Spark's linear algebra like vector or matrix code in
> their application, this will be very useful. This can help to share code in
> Spark training pipeline and production deployment. Also, lots of good stuff
> at Spark's mllib doesn't depend on Spark platform, and people can use them
> in their application without pulling lots of dependencies. In fact, in my
> project, I have to copy & paste code from mllib into my project to use
> those goodies in apps.
>
>
>
> 3) Currently, mllib depends on graphx which means in graphx, there is no
> way to use mllib's vector or matrix. And
>



-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread DB Tsai
I think the use-case can be quick different from PMML.

By having a Spark platform independent ML jar, this can empower users to do
the following,

1) PMML doesn't contain all the models we have in mllib. Also, for a ML
pipeline trained by Spark, most of time, PMML is not expressive enough to
do all the transformation we have in Spark ML. As a result, if we are able
to serialize the entire Spark ML pipeline after training, and then load
them back in app without any Spark platform for production scorning, this
will be very useful for production deployment of Spark ML models. The only
issue will be if the transformer involves with shuffle, we need to figure
out a way to handle it. When I chatted with Xiangrui about this, he
suggested that we may tag if a transformer is shuffle ready. Currently, at
Netflix, we are not able to use ML pipeline because of those issues, and we
have to write our own scorers in our production which is quite a duplicated
work.

2) If users can use Spark's linear algebra like vector or matrix code in
their application, this will be very useful. This can help to share code in
Spark training pipeline and production deployment. Also, lots of good stuff
at Spark's mllib doesn't depend on Spark platform, and people can use them
in their application without pulling lots of dependencies. In fact, in my
project, I have to copy & paste code from mllib into my project to use
those goodies in apps.

3) Currently, mllib depends on graphx which means in graphx, there is no
way to use mllib's vector or matrix. And at Netflix, we implemented
parallel personalized page rank which requires to use sparse vector as part
of public api. We have to use breeze here since no access to mllib's basic
type in graphx. Before we contribute it back to open source community, we
need to address this.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Thu, Nov 12, 2015 at 3:42 AM, Sean Owen <so...@cloudera.com> wrote:

> This is all starting to sound a lot like what's already implemented in
> Java-based PMML parsing/scoring libraries like JPMML and OpenScoring. I'm
> not clear it helps a lot to reimplement this in Spark.
>
> On Thu, Nov 12, 2015 at 8:05 AM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> +1 on that. It would be useful to use the model outside of Spark.
>>
>>
>> _
>> From: DB Tsai <dbt...@dbtsai.com>
>> Sent: Wednesday, November 11, 2015 11:57 PM
>> Subject: Re: thought experiment: use spark ML to real time prediction
>> To: Nirmal Fernando <nir...@wso2.com>
>> Cc: Andy Davidson <a...@santacruzintegration.com>, Adrian Tanase <
>> atan...@adobe.com>, user @spark <user@spark.apache.org>
>>
>>
>>
>> Do you think it will be useful to separate those models and model
>> loader/writer code into another spark-ml-common jar without any spark
>> platform dependencies so users can load the models trained by Spark ML in
>> their application and run the prediction?
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>> On Wed, Nov 11, 2015 at 3:14 AM, Nirmal Fernando <nir...@wso2.com>
>> wrote:
>>
>>> As of now, we are basically serializing the ML model and then
>>> deserialize it for prediction at real time.
>>>
>>> On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase <atan...@adobe.com>
>>> wrote:
>>>
>>>> I don’t think this answers your question but here’s how you would
>>>> evaluate the model in realtime in a streaming app
>>>>
>>>> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html
>>>>
>>>> Maybe you can find a way to extract portions of MLLib and run them
>>>> outside of spark – loading the precomputed model and calling .predict on
>>>> it…
>>>>
>>>> -adrian
>>>>
>>>> From: Andy Davidson
>>>> Date: Tuesday, November 10, 2015 at 11:31 PM
>>>> To: "user @spark"
>>>> Subject: thought experiment: use spark ML to real time prediction
>>>>
>>>> Lets say I have use spark ML to train a linear model. I know I can save
>>>> and load the model to disk. I am not sure how I can use the model in a real
>>>> time environment. For example I do not think I can return a “prediction” to
>>>> the client using spark streaming easily. Also for some applications the
>>>> extra latency created b

Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread Nirmal Fernando
As of now, we are basically serializing the ML model and then deserialize
it for prediction at real time.

On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase <atan...@adobe.com> wrote:

> I don’t think this answers your question but here’s how you would evaluate
> the model in realtime in a streaming app
>
> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html
>
> Maybe you can find a way to extract portions of MLLib and run them outside
> of spark – loading the precomputed model and calling .predict on it…
>
> -adrian
>
> From: Andy Davidson
> Date: Tuesday, November 10, 2015 at 11:31 PM
> To: "user @spark"
> Subject: thought experiment: use spark ML to real time prediction
>
> Lets say I have use spark ML to train a linear model. I know I can save
> and load the model to disk. I am not sure how I can use the model in a real
> time environment. For example I do not think I can return a “prediction” to
> the client using spark streaming easily. Also for some applications the
> extra latency created by the batch process might not be acceptable.
>
> If I was not using spark I would re-implement the model I trained in my
> batch environment in a lang like Java  and implement a rest service that
> uses the model to create a prediction and return the prediction to the
> client. Many models make predictions using linear algebra. Implementing
> predictions is relatively easy if you have a good vectorized LA package. Is
> there a way to use a model I trained using spark ML outside of spark?
>
> As a motivating example, even if its possible to return data to the client
> using spark streaming. I think the mini batch latency would not be
> acceptable for a high frequency stock trading system.
>
> Kind regards
>
> Andy
>
> P.s. The examples I have seen so far use spark streaming to “preprocess”
> predictions. For example a recommender system might use what current users
> are watching to calculate “trending recommendations”. These are stored on
> disk and served up to users when the use the “movie guide”. If a
> recommendation was a couple of min. old it would not effect the end users
> experience.
>
>


-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread Adrian Tanase
I don’t think this answers your question but here’s how you would evaluate the 
model in realtime in a streaming app
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html

Maybe you can find a way to extract portions of MLLib and run them outside of 
spark – loading the precomputed model and calling .predict on it…

-adrian

From: Andy Davidson
Date: Tuesday, November 10, 2015 at 11:31 PM
To: "user @spark"
Subject: thought experiment: use spark ML to real time prediction

Lets say I have use spark ML to train a linear model. I know I can save and 
load the model to disk. I am not sure how I can use the model in a real time 
environment. For example I do not think I can return a “prediction” to the 
client using spark streaming easily. Also for some applications the extra 
latency created by the batch process might not be acceptable.

If I was not using spark I would re-implement the model I trained in my batch 
environment in a lang like Java  and implement a rest service that uses the 
model to create a prediction and return the prediction to the client. Many 
models make predictions using linear algebra. Implementing predictions is 
relatively easy if you have a good vectorized LA package. Is there a way to use 
a model I trained using spark ML outside of spark?

As a motivating example, even if its possible to return data to the client 
using spark streaming. I think the mini batch latency would not be acceptable 
for a high frequency stock trading system.

Kind regards

Andy

P.s. The examples I have seen so far use spark streaming to “preprocess” 
predictions. For example a recommender system might use what current users are 
watching to calculate “trending recommendations”. These are stored on disk and 
served up to users when the use the “movie guide”. If a recommendation was a 
couple of min. old it would not effect the end users experience.



Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread DB Tsai
Do you think it will be useful to separate those models and model
loader/writer code into another spark-ml-common jar without any spark
platform dependencies so users can load the models trained by Spark ML in
their application and run the prediction?


Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Wed, Nov 11, 2015 at 3:14 AM, Nirmal Fernando <nir...@wso2.com> wrote:

> As of now, we are basically serializing the ML model and then deserialize
> it for prediction at real time.
>
> On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase <atan...@adobe.com> wrote:
>
>> I don’t think this answers your question but here’s how you would
>> evaluate the model in realtime in a streaming app
>>
>> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html
>>
>> Maybe you can find a way to extract portions of MLLib and run them
>> outside of spark – loading the precomputed model and calling .predict on it…
>>
>> -adrian
>>
>> From: Andy Davidson
>> Date: Tuesday, November 10, 2015 at 11:31 PM
>> To: "user @spark"
>> Subject: thought experiment: use spark ML to real time prediction
>>
>> Lets say I have use spark ML to train a linear model. I know I can save
>> and load the model to disk. I am not sure how I can use the model in a real
>> time environment. For example I do not think I can return a “prediction” to
>> the client using spark streaming easily. Also for some applications the
>> extra latency created by the batch process might not be acceptable.
>>
>> If I was not using spark I would re-implement the model I trained in my
>> batch environment in a lang like Java  and implement a rest service that
>> uses the model to create a prediction and return the prediction to the
>> client. Many models make predictions using linear algebra. Implementing
>> predictions is relatively easy if you have a good vectorized LA package. Is
>> there a way to use a model I trained using spark ML outside of spark?
>>
>> As a motivating example, even if its possible to return data to the
>> client using spark streaming. I think the mini batch latency would not be
>> acceptable for a high frequency stock trading system.
>>
>> Kind regards
>>
>> Andy
>>
>> P.s. The examples I have seen so far use spark streaming to “preprocess”
>> predictions. For example a recommender system might use what current users
>> are watching to calculate “trending recommendations”. These are stored on
>> disk and served up to users when the use the “movie guide”. If a
>> recommendation was a couple of min. old it would not effect the end users
>> experience.
>>
>>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


thought experiment: use spark ML to real time prediction

2015-11-10 Thread Andy Davidson
Lets say I have use spark ML to train a linear model. I know I can save and
load the model to disk. I am not sure how I can use the model in a real time
environment. For example I do not think I can return a ³prediction² to the
client using spark streaming easily. Also for some applications the extra
latency created by the batch process might not be acceptable.

If I was not using spark I would re-implement the model I trained in my
batch environment in a lang like Java  and implement a rest service that
uses the model to create a prediction and return the prediction to the
client. Many models make predictions using linear algebra. Implementing
predictions is relatively easy if you have a good vectorized LA package. Is
there a way to use a model I trained using spark ML outside of spark?

As a motivating example, even if its possible to return data to the client
using spark streaming. I think the mini batch latency would not be
acceptable for a high frequency stock trading system.

Kind regards

Andy

P.s. The examples I have seen so far use spark streaming to ³preprocess²
predictions. For example a recommender system might use what current users
are watching to calculate ³trending recommendations². These are stored on
disk and served up to users when the use the ³movie guide². If a
recommendation was a couple of min. old it would not effect the end users
experience.





RE: thought experiment: use spark ML to real time prediction

2015-11-10 Thread Kothuvatiparambil, Viju
I have a similar issue.  I want to load a model saved by a spark machine 
learning job, in a web application.

model.save(jsc.sc(), "myModelPath");

LogisticRegressionModel model = 
LogisticRegressionModel.load(
jsc.sc(), "myModelPath");

When I do that, I need to pass a spark context for loading the model.  The 
model is small and can be saved to local file system, so is there any way to 
use it without the spark context?  Looks like creating spark context is an 
expensive step that internally starts a jetty server.  I do not want to start 
one more web server inside a web application.

A solution that I received (pasted below) was to export the model into a 
generic format such as PMML. I haven't tried it, and I am hoping to find a way 
to use the model without adding a lot more dependencies and code to the project.


On Oct 30, 2015, at 2:11 PM, Stefano Baghino 
<stefano.bagh...@radicalbit.io<mailto:stefano.bagh...@radicalbit.io>> wrote:
One possibility would be to export the model as a PMML (Predictive Model Markup 
Language, an XML-based standard to describe predictive models) and then use it 
in your Web app (using something like JPMML<https://github.com/jpmml>, for 
example). You can directly export (some) models (including LinReg) since Spark 
1.4: https://databricks.com/blog/2015/07/02/pmml-support-in-spark-mllib.html

For more info on PMML support on MLlib (including model support): 
https://spark.apache.org/docs/latest/mllib-pmml-model-export.html
For more info on the PMML standard: 
http://dmg.org/pmml/v4-2-1/GeneralStructure.html


Thanks
Viju





From: Andy Davidson [mailto:a...@santacruzintegration.com]
Sent: Tuesday, November 10, 2015 1:32 PM
To: user @spark
Subject: thought experiment: use spark ML to real time prediction

Lets say I have use spark ML to train a linear model. I know I can save and 
load the model to disk. I am not sure how I can use the model in a real time 
environment. For example I do not think I can return a "prediction" to the 
client using spark streaming easily. Also for some applications the extra 
latency created by the batch process might not be acceptable.

If I was not using spark I would re-implement the model I trained in my batch 
environment in a lang like Java  and implement a rest service that uses the 
model to create a prediction and return the prediction to the client. Many 
models make predictions using linear algebra. Implementing predictions is 
relatively easy if you have a good vectorized LA package. Is there a way to use 
a model I trained using spark ML outside of spark?

As a motivating example, even if its possible to return data to the client 
using spark streaming. I think the mini batch latency would not be acceptable 
for a high frequency stock trading system.

Kind regards

Andy

P.s. The examples I have seen so far use spark streaming to "preprocess" 
predictions. For example a recommender system might use what current users are 
watching to calculate "trending recommendations". These are stored on disk and 
served up to users when the use the "movie guide". If a recommendation was a 
couple of min. old it would not effect the end users experience.

--
This message, and any attachments, is for the intended recipient(s) only, may 
contain information that is privileged, confidential and/or proprietary and 
subject to important terms and conditions available at 
http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended 
recipient, please delete this message.