Re: Revisiting Online serving of Spark models?

Liang-Chi Hsieh Mon, 11 Jun 2018 20:03:45 -0700

Hi,


It'd be great if there can be any sharing of the offline discussion. Thanks!



Holden Karau wrote
> We’re by the registration sign going to start walking over at 4:05
> 
> On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice <

> maximilianofelice@

>> wrote:
> 
>> Hi!
>>
>> Do we meet at the entrance?
>>
>> See you
>>
>>
>> El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath <
>> 

> nick.pentreath@

>> escribió:
>>
>>> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.
>>>
>>> On Sun, 3 Jun 2018 at 00:24 Holden Karau &lt;

> holden@

> &gt; wrote:
>>>
>>>> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
>>>> 

> maximilianofelice@

>> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> We're already in San Francisco waiting for the summit. We even think
>>>>> that we spotted @holdenk this afternoon.
>>>>>
>>>> Unless you happened to be walking by my garage probably not super
>>>> likely, spent the day working on scooters/motorcycles (my style is a
>>>> little
>>>> less unique in SF :)). Also if you see me feel free to say hi unless I
>>>> look
>>>> like I haven't had my first coffee of the day, love chatting with folks
>>>> IRL
>>>> :)
>>>>
>>>>>
>>>>> @chris, we're really interested in the Meetup you're hosting. My team
>>>>> will probably join it since the beginning of you have room for us, and
>>>>> I'll
>>>>> join it later after discussing the topics on this thread. I'll send
>>>>> you an
>>>>> email regarding this request.
>>>>>
>>>>> Thanks
>>>>>
>>>>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <
>>>>> 

> sxk1969@

>> escribió:
>>>>>
>>>>>> @Chris This sounds fantastic, please send summary notes for Seattle
>>>>>> folks
>>>>>>
>>>>>> @Felix I work in downtown Seattle, am wondering if we should a tech
>>>>>> meetup around model serving in spark at my work or elsewhere close,
>>>>>> thoughts?  I’m actually in the midst of building microservices to
>>>>>> manage
>>>>>> models and when I say models I mean much more than machine learning
>>>>>> models
>>>>>> (think OR, process models as well)
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On May 31, 2018, at 10:32 PM, Chris Fregly &lt;

> chris@

> &gt; wrote:
>>>>>>
>>>>>> Hey everyone!
>>>>>>
>>>>>> @Felix:  thanks for putting this together.  i sent some of you a
>>>>>> quick
>>>>>> calendar event - mostly for me, so i don’t forget!  :)
>>>>>>
>>>>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>>>>>> TensorFlow Meetup*
>>>>>> &lt;https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/&gt;
>>>>>> @5:30pm
>>>>>> on June 6th (same night) here in SF!
>>>>>>
>>>>>> Everybody is welcome to come.  Here’s the link to the meetup that
>>>>>> includes the signup link:
>>>>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>>>>>> &lt;https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/&gt;
>>>>>>
>>>>>> We have an awesome lineup of speakers covered a lot of deep,
>>>>>> technical
>>>>>> ground.
>>>>>>
>>>>>> For those who can’t attend in person, we’ll be broadcasting live -
>>>>>> and
>>>>>> posting the recording afterward.
>>>>>>
>>>>>> All details are in the meetup link above…
>>>>>>
>>>>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>>>>>> welcome to give a talk. I can move things around to make room.
>>>>>>
>>>>>> @joseph:  I’d personally like an update on the direction of the
>>>>>> Databricks proprietary ML Serving export format which is similar to
>>>>>> PMML
>>>>>> but not a standard in any way.
>>>>>>
>>>>>> Also, the Databricks ML Serving Runtime is only available to
>>>>>> Databricks customers.  This seems in conflict with the community
>>>>>> efforts
>>>>>> described here.  Can you comment on behalf of Databricks?
>>>>>>
>>>>>> Look forward to your response, joseph.
>>>>>>
>>>>>> See you all soon!
>>>>>>
>>>>>> —
>>>>>>
>>>>>>
>>>>>> *Chris Fregly *Founder @ *PipelineAI* &lt;https://pipeline.ai/&gt;
>>>>>> (100,000
>>>>>> Users)
>>>>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>>>>>> &lt;https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/&gt;
>>>>>> (85,000
>>>>>> Global Members)
>>>>>>
>>>>>>
>>>>>>
>>>>>> *San Francisco - Chicago - Austin -
>>>>>> Washington DC - London - Dusseldorf *
>>>>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>>>>>> &lt;http://community.pipeline.ai/&gt;*
>>>>>>
>>>>>>
>>>>>> On May 30, 2018, at 9:32 AM, Felix Cheung &lt;

> felixcheung_m@

> &gt;
>>>>>> wrote:
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> Thank you! Let’s meet then
>>>>>>
>>>>>> June 6 4pm
>>>>>>
>>>>>> Moscone West Convention Center
>>>>>> 800 Howard Street, San Francisco, CA 94103
>>>>>> &lt;https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&amp;entry=gmail&amp;source=g&gt;
>>>>>>
>>>>>> Ground floor (outside of conference area - should be available for
>>>>>> all) - we will meet and decide where to go
>>>>>>
>>>>>> (Would not send invite because that would be too much noise for dev@)
>>>>>>
>>>>>> To paraphrase Joseph, we will use this to kick off the discusssion
>>>>>> and
>>>>>> post notes after and follow up online. As for Seattle, I would be
>>>>>> very
>>>>>> interested to meet in person lateen and discuss ;)
>>>>>>
>>>>>>
>>>>>> _____________________________
>>>>>> From: Saikat Kanjilal &lt;

> sxk1969@

> &gt;
>>>>>> Sent: Tuesday, May 29, 2018 11:46 AM
>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>> To: Maximiliano Felice &lt;

> maximilianofelice@

> &gt;
>>>>>> Cc: Felix Cheung &lt;

> felixcheung_m@

> &gt;, Holden Karau <
>>>>>> 

> holden@

>>, Joseph Bradley &lt;

> joseph@

> &gt;, Leif
>>>>>> Walsh &lt;

> leif.walsh@

> &gt;, dev &lt;

> [email protected]

> &gt;
>>>>>>
>>>>>>
>>>>>> Would love to join but am in Seattle, thoughts on how to make this
>>>>>> work?
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>>>>>> 

> maximilianofelice@

>> wrote:
>>>>>>
>>>>>> Big +1 to a meeting with fresh air.
>>>>>>
>>>>>> Could anyone send the invites? I don't really know which is the place
>>>>>> Holden is talking about.
>>>>>>
>>>>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung &lt;

> felixcheung_m@

> &gt;:
>>>>>>
>>>>>>> You had me at blue bottle!
>>>>>>>
>>>>>>> _____________________________
>>>>>>> From: Holden Karau &lt;

> holden@

> &gt;
>>>>>>> Sent: Tuesday, May 29, 2018 9:47 AM
>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>>> To: Felix Cheung &lt;

> felixcheung_m@

> &gt;
>>>>>>> Cc: Saikat Kanjilal &lt;

> sxk1969@

> &gt;, Maximiliano Felice <
>>>>>>> 

> maximilianofelice@

>>, Joseph Bradley &lt;

> joseph@

> &gt;,
>>>>>>> Leif Walsh &lt;

> leif.walsh@

> &gt;, dev &lt;

> [email protected]

> &gt;
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'm down for that, we could all go for a walk maybe to the mint
>>>>>>> plazaa blue bottle and grab coffee (if the weather holds have our
>>>>>>> design
>>>>>>> meeting outside :p)?
>>>>>>>
>>>>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <
>>>>>>> 

> felixcheung_m@

>> wrote:
>>>>>>>
>>>>>>>> Bump.
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>> *From:* Felix Cheung &lt;

> felixcheung_m@

> &gt;
>>>>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>>>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>>>>>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>>>>>>
>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>>
>>>>>>>> Hi! How about we meet the community and discuss on June 6 4pm at
>>>>>>>> (near) the Summit?
>>>>>>>>
>>>>>>>> (I propose we meet at the venue entrance so we could accommodate
>>>>>>>> people might not be in the conference)
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>> *From:* Saikat Kanjilal &lt;

> sxk1969@

> &gt;
>>>>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>>>>>>> *To:* Maximiliano Felice
>>>>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>>
>>>>>>>> I’m in the same exact boat as Maximiliano and have use cases as
>>>>>>>> well
>>>>>>>> for model serving and would love to join this discussion.
>>>>>>>>
>>>>>>>> Sent from my iPhone
>>>>>>>>
>>>>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>>>>>>> 

> maximilianofelice@

>> wrote:
>>>>>>>>
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> I'm don't usually write a lot on this list but I keep up to date
>>>>>>>> with the discussions and I'm a heavy user of Spark. This topic
>>>>>>>> caught my
>>>>>>>> attention, as we're currently facing this issue at work. I'm
>>>>>>>> attending to
>>>>>>>> the summit and was wondering if it would it be possible for me to
>>>>>>>> join that
>>>>>>>> meeting. I might be able to share some helpful usecases and ideas.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Maximiliano Felice
>>>>>>>>
>>>>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <
>>>>>>>> 

> leif.walsh@

>> escribió:
>>>>>>>>
>>>>>>>>> I’m with you on json being more readable than parquet, but we’ve
>>>>>>>>> had success using pyarrow’s parquet reader and have been quite
>>>>>>>>> happy with
>>>>>>>>> it so far. If your target is python (and probably if not now, then
>>>>>>>>> soon,
>>>>>>>>> R), you should look in to it.
>>>>>>>>>
>>>>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley &lt;

> joseph@

> &gt;
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Regarding model reading and writing, I'll give quick thoughts
>>>>>>>>>> here:
>>>>>>>>>> * Our approach was to use the same format but write JSON instead
>>>>>>>>>> of Parquet.  It's easier to parse JSON without Spark, and using
>>>>>>>>>> the same
>>>>>>>>>> format simplifies architecture.  Plus, some people want to check
>>>>>>>>>> files into
>>>>>>>>>> version control, and JSON is nice for that.
>>>>>>>>>> * The reader/writer APIs could be extended to take format
>>>>>>>>>> parameters (just like DataFrame reader/writers) to handle JSON
>>>>>>>>>> (and maybe,
>>>>>>>>>> eventually, handle Parquet in the online serving setting).
>>>>>>>>>>
>>>>>>>>>> This would be a big project, so proposing a SPIP might be best.
>>>>>>>>>> If people are around at the Spark Summit, that could be a good
>>>>>>>>>> time to meet
>>>>>>>>>> up & then post notes back to the dev list.
>>>>>>>>>>
>>>>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>>>>>>> 

> felixcheung_m@

>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite
>>>>>>>>>>> implementations
>>>>>>>>>>> that rely on SparkContext. This is a big blocker on reusing 
>>>>>>>>>>> trained models
>>>>>>>>>>> outside of Spark for online serving.
>>>>>>>>>>>
>>>>>>>>>>> What’s the next step? Would folks be interested in getting
>>>>>>>>>>> together to discuss/get some feedback?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _____________________________
>>>>>>>>>>> From: Felix Cheung &lt;

> felixcheung_m@

> &gt;
>>>>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>>>>>>> To: Holden Karau &lt;

> holden@

> &gt;, Joseph Bradley <
>>>>>>>>>>> 

> joseph@

>>
>>>>>>>>>>> Cc: dev &lt;

> [email protected]

> &gt;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Huge +1 on this!
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------
>>>>>>>>>>> *From:*

> holden.karau@

>  &lt;

> holden.karau@

> &gt; on behalf
>>>>>>>>>>> of Holden Karau &lt;

> holden@

> &gt;
>>>>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>>>>>>> *To:* Joseph Bradley
>>>>>>>>>>> *Cc:* dev
>>>>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>>>>>>> 

> joseph@

>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of
>>>>>>>>>>>> this.
>>>>>>>>>>>>
>>>>>>>>>>>> Awesome! I'm glad other folks think something like this belongs
>>>>>>>>>>> in Spark.
>>>>>>>>>>>
>>>>>>>>>>>> This was one of the original goals for mllib-local: to have
>>>>>>>>>>>> local versions of MLlib models which could be deployed without
>>>>>>>>>>>> the big
>>>>>>>>>>>> Spark JARs and without a SparkContext or SparkSession.  There
>>>>>>>>>>>> are related
>>>>>>>>>>>> commercial offerings like this : ) but the overhead of
>>>>>>>>>>>> maintaining those
>>>>>>>>>>>> offerings is pretty high.  Building good APIs within MLlib to
>>>>>>>>>>>> avoid copying
>>>>>>>>>>>> logic across libraries will be well worth it.
>>>>>>>>>>>>
>>>>>>>>>>>> We've talked about this need at Databricks and have also been
>>>>>>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
>>>>>>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>>>>>>> * It'd be valuable to have this go beyond adding transform()
>>>>>>>>>>>> methods taking a Row to the current Models.  Instead, it would
>>>>>>>>>>>> be ideal to
>>>>>>>>>>>> have local, lightweight versions of models in mllib-local,
>>>>>>>>>>>> outside of the
>>>>>>>>>>>> main mllib package (for easier deployment with smaller & fewer
>>>>>>>>>>>> dependencies).
>>>>>>>>>>>> * Supporting Pipelines is important.  For this, it would be
>>>>>>>>>>>> ideal to utilize elements of Spark SQL, particularly Rows and
>>>>>>>>>>>> Types, which
>>>>>>>>>>>> could be moved into a local sql package.
>>>>>>>>>>>> * This architecture may require some awkward APIs currently to
>>>>>>>>>>>> have model prediction logic in mllib-local, local model classes
>>>>>>>>>>>> in
>>>>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model classes in
>>>>>>>>>>>> mllib.  We
>>>>>>>>>>>> might find it helpful to break some DeveloperApis in Spark 3.0
>>>>>>>>>>>> to
>>>>>>>>>>>> facilitate this architecture while making it feasible for 3rd
>>>>>>>>>>>> party
>>>>>>>>>>>> developers to extend MLlib APIs (especially in Java).
>>>>>>>>>>>>
>>>>>>>>>>> I agree this could be interesting, and feed into the other
>>>>>>>>>>> discussion around when (or if) we should be considering Spark
>>>>>>>>>>> 3.0
>>>>>>>>>>> I _think_ we could probably do it with optional traits people
>>>>>>>>>>> could mix in to avoid breaking the current APIs but I could be
>>>>>>>>>>> wrong on
>>>>>>>>>>> that point.
>>>>>>>>>>>
>>>>>>>>>>>> * It could also be worth discussing local DataFrames.  They
>>>>>>>>>>>> might not be as important as per-Row transformations, but they
>>>>>>>>>>>> would be
>>>>>>>>>>>> helpful for batching for higher throughput.
>>>>>>>>>>>>
>>>>>>>>>>> That could be interesting as well.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>>>>>>
>>>>>>>>>>>> Joseph
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <
>>>>>>>>>>>> 

> holden@

>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi y'all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> With the renewed interest in ML in Apache Spark now seems like
>>>>>>>>>>>>> a good a time as any to revisit the online serving situation
>>>>>>>>>>>>> in Spark ML.
>>>>>>>>>>>>> DB & other's have done some excellent working moving a lot of
>>>>>>>>>>>>> the necessary
>>>>>>>>>>>>> tools into a local linear algebra package that doesn't depend
>>>>>>>>>>>>> on having a
>>>>>>>>>>>>> SparkContext.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are a few different commercial and non-commercial
>>>>>>>>>>>>> solutions round this, but currently our individual
>>>>>>>>>>>>> transform/predict
>>>>>>>>>>>>> methods are private so they either need to copy or
>>>>>>>>>>>>> re-implement (or put
>>>>>>>>>>>>> them selves in org.apache.spark) to access them. How would
>>>>>>>>>>>>> folks feel about
>>>>>>>>>>>>> adding a new trait for ML pipeline stages to expose to do
>>>>>>>>>>>>> transformation of
>>>>>>>>>>>>> single element inputs (or local collections) that could be
>>>>>>>>>>>>> optionally
>>>>>>>>>>>>> implemented by stages which support this? That way we can have
>>>>>>>>>>>>> less copy
>>>>>>>>>>>>> and paste code possibly getting out of sync with our model
>>>>>>>>>>>>> training.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>>>>>>>> projects is probably the right path, forward (folks have
>>>>>>>>>>>>> different needs),
>>>>>>>>>>>>> but I'd love to see us make it simpler for other projects to
>>>>>>>>>>>>> build reliable
>>>>>>>>>>>>> serving tools.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I realize this maybe puts some of the folks in an awkward
>>>>>>>>>>>>> position with their own commercial offerings, but hopefully if
>>>>>>>>>>>>> we make it
>>>>>>>>>>>>> easier for everyone the commercial vendors can benefit as
>>>>>>>>>>>>> well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Holden :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Joseph Bradley
>>>>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>>>>> Databricks, Inc.
>>>>>>>>>>>> [image: http://databricks.com] &lt;http://databricks.com/&gt;
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Joseph Bradley
>>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>>> Databricks, Inc.
>>>>>>>>>> [image: http://databricks.com] &lt;http://databricks.com/&gt;
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> --
>>>>>>>>> Cheers,
>>>>>>>>> Leif
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>> --
> Twitter: https://twitter.com/holdenkarau





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: Revisiting Online serving of Spark models?

Reply via email to