Re: Revisiting Online serving of Spark models?

Nick Pentreath Tue, 05 Jun 2018 15:08:11 -0700

I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.

On Sun, 3 Jun 2018 at 00:24 Holden Karau <[email protected]> wrote:


> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
> [email protected]> wrote:
>
>> Hi!
>>
>> We're already in San Francisco waiting for the summit. We even think that
>> we spotted @holdenk this afternoon.
>>
> Unless you happened to be walking by my garage probably not super likely,
> spent the day working on scooters/motorcycles (my style is a little less
> unique in SF :)). Also if you see me feel free to say hi unless I look like
> I haven't had my first coffee of the day, love chatting with folks IRL :)
>
>>
>> @chris, we're really interested in the Meetup you're hosting. My team
>> will probably join it since the beginning of you have room for us, and I'll
>> join it later after discussing the topics on this thread. I'll send you an
>> email regarding this request.
>>
>> Thanks
>>
>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <[email protected]>
>> escribió:
>>
>>> @Chris This sounds fantastic, please send summary notes for Seattle
>>> folks
>>>
>>> @Felix I work in downtown Seattle, am wondering if we should a tech
>>> meetup around model serving in spark at my work or elsewhere close,
>>> thoughts?  I’m actually in the midst of building microservices to manage
>>> models and when I say models I mean much more than machine learning models
>>> (think OR, process models as well)
>>>
>>> Regards
>>>
>>> Sent from my iPhone
>>>
>>> On May 31, 2018, at 10:32 PM, Chris Fregly <[email protected]> wrote:
>>>
>>> Hey everyone!
>>>
>>> @Felix:  thanks for putting this together.  i sent some of you a quick
>>> calendar event - mostly for me, so i don’t forget!  :)
>>>
>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>>> TensorFlow Meetup*
>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>>  @5:30pm
>>> on June 6th (same night) here in SF!
>>>
>>> Everybody is welcome to come.  Here’s the link to the meetup that
>>> includes the signup link:
>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>>
>>> We have an awesome lineup of speakers covered a lot of deep, technical
>>> ground.
>>>
>>> For those who can’t attend in person, we’ll be broadcasting live - and
>>> posting the recording afterward.
>>>
>>> All details are in the meetup link above…
>>>
>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>>> welcome to give a talk. I can move things around to make room.
>>>
>>> @joseph:  I’d personally like an update on the direction of the
>>> Databricks proprietary ML Serving export format which is similar to PMML
>>> but not a standard in any way.
>>>
>>> Also, the Databricks ML Serving Runtime is only available to Databricks
>>> customers.  This seems in conflict with the community efforts described
>>> here.  Can you comment on behalf of Databricks?
>>>
>>> Look forward to your response, joseph.
>>>
>>> See you all soon!
>>>
>>> —
>>>
>>>
>>> *Chris Fregly *Founder @ *PipelineAI* <https://pipeline.ai/> (100,000
>>> Users)
>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000
>>> Global Members)
>>>
>>>
>>>
>>> *San Francisco - Chicago - Austin -  Washington DC - London - Dusseldorf
>>> *
>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>>> <http://community.pipeline.ai/>*
>>>
>>>
>>> On May 30, 2018, at 9:32 AM, Felix Cheung <[email protected]>
>>> wrote:
>>>
>>> Hi!
>>>
>>> Thank you! Let’s meet then
>>>
>>> June 6 4pm
>>>
>>> Moscone West Convention Center
>>> 800 Howard Street, San Francisco, CA 94103
>>> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>>>
>>> Ground floor (outside of conference area - should be available for all)
>>> - we will meet and decide where to go
>>>
>>> (Would not send invite because that would be too much noise for dev@)
>>>
>>> To paraphrase Joseph, we will use this to kick off the discusssion and
>>> post notes after and follow up online. As for Seattle, I would be very
>>> interested to meet in person lateen and discuss ;)
>>>
>>>
>>> _____________________________
>>> From: Saikat Kanjilal <[email protected]>
>>> Sent: Tuesday, May 29, 2018 11:46 AM
>>> Subject: Re: Revisiting Online serving of Spark models?
>>> To: Maximiliano Felice <[email protected]>
>>> Cc: Felix Cheung <[email protected]>, Holden Karau <
>>> [email protected]>, Joseph Bradley <[email protected]>, Leif
>>> Walsh <[email protected]>, dev <[email protected]>
>>>
>>>
>>> Would love to join but am in Seattle, thoughts on how to make this work?
>>>
>>> Regards
>>>
>>> Sent from my iPhone
>>>
>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>>> [email protected]> wrote:
>>>
>>> Big +1 to a meeting with fresh air.
>>>
>>> Could anyone send the invites? I don't really know which is the place
>>> Holden is talking about.
>>>
>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung <[email protected]>:
>>>
>>>> You had me at blue bottle!
>>>>
>>>> _____________________________
>>>> From: Holden Karau <[email protected]>
>>>> Sent: Tuesday, May 29, 2018 9:47 AM
>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>> To: Felix Cheung <[email protected]>
>>>> Cc: Saikat Kanjilal <[email protected]>, Maximiliano Felice <
>>>> [email protected]>, Joseph Bradley <[email protected]>,
>>>> Leif Walsh <[email protected]>, dev <[email protected]>
>>>>
>>>>
>>>>
>>>> I'm down for that, we could all go for a walk maybe to the mint plazaa
>>>> blue bottle and grab coffee (if the weather holds have our design meeting
>>>> outside :p)?
>>>>
>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <
>>>> [email protected]> wrote:
>>>>
>>>>> Bump.
>>>>>
>>>>> ------------------------------
>>>>> *From:* Felix Cheung <[email protected]>
>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>>>
>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>
>>>>> Hi! How about we meet the community and discuss on June 6 4pm at
>>>>> (near) the Summit?
>>>>>
>>>>> (I propose we meet at the venue entrance so we could accommodate
>>>>> people might not be in the conference)
>>>>>
>>>>> ------------------------------
>>>>> *From:* Saikat Kanjilal <[email protected]>
>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>>>> *To:* Maximiliano Felice
>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>
>>>>> I’m in the same exact boat as Maximiliano and have use cases as well
>>>>> for model serving and would love to join this discussion.
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>>>> [email protected]> wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> I'm don't usually write a lot on this list but I keep up to date with
>>>>> the discussions and I'm a heavy user of Spark. This topic caught my
>>>>> attention, as we're currently facing this issue at work. I'm attending to
>>>>> the summit and was wondering if it would it be possible for me to join 
>>>>> that
>>>>> meeting. I might be able to share some helpful usecases and ideas.
>>>>>
>>>>> Thanks,
>>>>> Maximiliano Felice
>>>>>
>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[email protected]>
>>>>> escribió:
>>>>>
>>>>>> I’m with you on json being more readable than parquet, but we’ve had
>>>>>> success using pyarrow’s parquet reader and have been quite happy with it 
>>>>>> so
>>>>>> far. If your target is python (and probably if not now, then soon, R), 
>>>>>> you
>>>>>> should look in to it.
>>>>>>
>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Regarding model reading and writing, I'll give quick thoughts here:
>>>>>>> * Our approach was to use the same format but write JSON instead of
>>>>>>> Parquet.  It's easier to parse JSON without Spark, and using the same
>>>>>>> format simplifies architecture.  Plus, some people want to check files 
>>>>>>> into
>>>>>>> version control, and JSON is nice for that.
>>>>>>> * The reader/writer APIs could be extended to take format parameters
>>>>>>> (just like DataFrame reader/writers) to handle JSON (and maybe, 
>>>>>>> eventually,
>>>>>>> handle Parquet in the online serving setting).
>>>>>>>
>>>>>>> This would be a big project, so proposing a SPIP might be best.  If
>>>>>>> people are around at the Spark Summit, that could be a good time to 
>>>>>>> meet up
>>>>>>> & then post notes back to the dev list.
>>>>>>>
>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite 
>>>>>>>> implementations
>>>>>>>> that rely on SparkContext. This is a big blocker on reusing  trained 
>>>>>>>> models
>>>>>>>> outside of Spark for online serving.
>>>>>>>>
>>>>>>>> What’s the next step? Would folks be interested in getting together
>>>>>>>> to discuss/get some feedback?
>>>>>>>>
>>>>>>>>
>>>>>>>> _____________________________
>>>>>>>> From: Felix Cheung <[email protected]>
>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>>>> To: Holden Karau <[email protected]>, Joseph Bradley <
>>>>>>>> [email protected]>
>>>>>>>> Cc: dev <[email protected]>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Huge +1 on this!
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>> *From:*[email protected] <[email protected]> on behalf
>>>>>>>> of Holden Karau <[email protected]>
>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>>>> *To:* Joseph Bradley
>>>>>>>> *Cc:* dev
>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of
>>>>>>>>> this.
>>>>>>>>>
>>>>>>>>> Awesome! I'm glad other folks think something like this belongs in
>>>>>>>> Spark.
>>>>>>>>
>>>>>>>>> This was one of the original goals for mllib-local: to have local
>>>>>>>>> versions of MLlib models which could be deployed without the big 
>>>>>>>>> Spark JARs
>>>>>>>>> and without a SparkContext or SparkSession.  There are related 
>>>>>>>>> commercial
>>>>>>>>> offerings like this : ) but the overhead of maintaining those 
>>>>>>>>> offerings is
>>>>>>>>> pretty high.  Building good APIs within MLlib to avoid copying logic 
>>>>>>>>> across
>>>>>>>>> libraries will be well worth it.
>>>>>>>>>
>>>>>>>>> We've talked about this need at Databricks and have also been
>>>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
>>>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>>>> * It'd be valuable to have this go beyond adding transform()
>>>>>>>>> methods taking a Row to the current Models.  Instead, it would be 
>>>>>>>>> ideal to
>>>>>>>>> have local, lightweight versions of models in mllib-local, outside of 
>>>>>>>>> the
>>>>>>>>> main mllib package (for easier deployment with smaller & fewer
>>>>>>>>> dependencies).
>>>>>>>>> * Supporting Pipelines is important.  For this, it would be ideal
>>>>>>>>> to utilize elements of Spark SQL, particularly Rows and Types, which 
>>>>>>>>> could
>>>>>>>>> be moved into a local sql package.
>>>>>>>>> * This architecture may require some awkward APIs currently to
>>>>>>>>> have model prediction logic in mllib-local, local model classes in
>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model classes in mllib. 
>>>>>>>>>  We
>>>>>>>>> might find it helpful to break some DeveloperApis in Spark 3.0 to
>>>>>>>>> facilitate this architecture while making it feasible for 3rd party
>>>>>>>>> developers to extend MLlib APIs (especially in Java).
>>>>>>>>>
>>>>>>>> I agree this could be interesting, and feed into the other
>>>>>>>> discussion around when (or if) we should be considering Spark 3.0
>>>>>>>> I _think_ we could probably do it with optional traits people could
>>>>>>>> mix in to avoid breaking the current APIs but I could be wrong on that
>>>>>>>> point.
>>>>>>>>
>>>>>>>>> * It could also be worth discussing local DataFrames.  They might
>>>>>>>>> not be as important as per-Row transformations, but they would be 
>>>>>>>>> helpful
>>>>>>>>> for batching for higher throughput.
>>>>>>>>>
>>>>>>>> That could be interesting as well.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>>>
>>>>>>>>> Joseph
>>>>>>>>>
>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[email protected]
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hi y'all,
>>>>>>>>>>
>>>>>>>>>> With the renewed interest in ML in Apache Spark now seems like a
>>>>>>>>>> good a time as any to revisit the online serving situation in Spark 
>>>>>>>>>> ML. DB
>>>>>>>>>> & other's have done some excellent working moving a lot of the 
>>>>>>>>>> necessary
>>>>>>>>>> tools into a local linear algebra package that doesn't depend on 
>>>>>>>>>> having a
>>>>>>>>>> SparkContext.
>>>>>>>>>>
>>>>>>>>>> There are a few different commercial and non-commercial solutions
>>>>>>>>>> round this, but currently our individual transform/predict methods 
>>>>>>>>>> are
>>>>>>>>>> private so they either need to copy or re-implement (or put them 
>>>>>>>>>> selves in
>>>>>>>>>> org.apache.spark) to access them. How would folks feel about adding 
>>>>>>>>>> a new
>>>>>>>>>> trait for ML pipeline stages to expose to do transformation of single
>>>>>>>>>> element inputs (or local collections) that could be optionally 
>>>>>>>>>> implemented
>>>>>>>>>> by stages which support this? That way we can have less copy and 
>>>>>>>>>> paste code
>>>>>>>>>> possibly getting out of sync with our model training.
>>>>>>>>>>
>>>>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>>>>> projects is probably the right path, forward (folks have different 
>>>>>>>>>> needs),
>>>>>>>>>> but I'd love to see us make it simpler for other projects to build 
>>>>>>>>>> reliable
>>>>>>>>>> serving tools.
>>>>>>>>>>
>>>>>>>>>> I realize this maybe puts some of the folks in an awkward
>>>>>>>>>> position with their own commercial offerings, but hopefully if we 
>>>>>>>>>> make it
>>>>>>>>>> easier for everyone the commercial vendors can benefit as well.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Holden :)
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Joseph Bradley
>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>> Databricks, Inc.
>>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Joseph Bradley
>>>>>>> Software Engineer - Machine Learning
>>>>>>> Databricks, Inc.
>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>
>>>>>> --
>>>>>> --
>>>>>> Cheers,
>>>>>> Leif
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>

Re: Revisiting Online serving of Spark models?

Reply via email to