Re: Revisiting Online serving of Spark models?

Denny Lee Wed, 30 May 2018 10:30:39 -0700

I most likely will not be able to join SF next week but definitely up for a
session after Summit in Seattle to dive further into this, eh?!


On Wed, May 30, 2018 at 9:32 AM Felix Cheung <[email protected]>
wrote:

> Hi!
>
> Thank you! Let’s meet then
>
> June 6 4pm
>
> Moscone West Convention Center
> 800 Howard Street, San Francisco, CA 94103
> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>
> Ground floor (outside of conference area - should be available for all) -
> we will meet and decide where to go
>
> (Would not send invite because that would be too much noise for dev@)
>
> To paraphrase Joseph, we will use this to kick off the discusssion and
> post notes after and follow up online. As for Seattle, I would be very
> interested to meet in person lateen and discuss ;)
>
>
> _____________________________
> From: Saikat Kanjilal <[email protected]>
> Sent: Tuesday, May 29, 2018 11:46 AM
>
> Subject: Re: Revisiting Online serving of Spark models?
> To: Maximiliano Felice <[email protected]>
> Cc: Felix Cheung <[email protected]>, Holden Karau <
> [email protected]>, Joseph Bradley <[email protected]>, Leif Walsh
> <[email protected]>, dev <[email protected]>
>
>
>
> Would love to join but am in Seattle, thoughts on how to make this work?
>
> Regards
>
> Sent from my iPhone
>
> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
> [email protected]> wrote:
>
> Big +1 to a meeting with fresh air.
>
> Could anyone send the invites? I don't really know which is the place
> Holden is talking about.
>
> 2018-05-29 14:27 GMT-03:00 Felix Cheung <[email protected]>:
>
>> You had me at blue bottle!
>>
>> _____________________________
>> From: Holden Karau <[email protected]>
>> Sent: Tuesday, May 29, 2018 9:47 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Felix Cheung <[email protected]>
>> Cc: Saikat Kanjilal <[email protected]>, Maximiliano Felice <
>> [email protected]>, Joseph Bradley <[email protected]>,
>> Leif Walsh <[email protected]>, dev <[email protected]>
>>
>>
>>
>> I'm down for that, we could all go for a walk maybe to the mint plazaa
>> blue bottle and grab coffee (if the weather holds have our design meeting
>> outside :p)?
>>
>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <[email protected]>
>> wrote:
>>
>>> Bump.
>>>
>>> ------------------------------
>>> *From:* Felix Cheung <[email protected]>
>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>> Hi! How about we meet the community and discuss on June 6 4pm at (near)
>>> the Summit?
>>>
>>> (I propose we meet at the venue entrance so we could accommodate people
>>> might not be in the conference)
>>>
>>> ------------------------------
>>> *From:* Saikat Kanjilal <[email protected]>
>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>> *To:* Maximiliano Felice
>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>> I’m in the same exact boat as Maximiliano and have use cases as well for
>>> model serving and would love to join this discussion.
>>>
>>> Sent from my iPhone
>>>
>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>> [email protected]> wrote:
>>>
>>> Hi!
>>>
>>> I'm don't usually write a lot on this list but I keep up to date with
>>> the discussions and I'm a heavy user of Spark. This topic caught my
>>> attention, as we're currently facing this issue at work. I'm attending to
>>> the summit and was wondering if it would it be possible for me to join that
>>> meeting. I might be able to share some helpful usecases and ideas.
>>>
>>> Thanks,
>>> Maximiliano Felice
>>>
>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[email protected]>
>>> escribió:
>>>
>>>> I’m with you on json being more readable than parquet, but we’ve had
>>>> success using pyarrow’s parquet reader and have been quite happy with it so
>>>> far. If your target is python (and probably if not now, then soon, R), you
>>>> should look in to it.
>>>>
>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <[email protected]>
>>>> wrote:
>>>>
>>>>> Regarding model reading and writing, I'll give quick thoughts here:
>>>>> * Our approach was to use the same format but write JSON instead of
>>>>> Parquet.  It's easier to parse JSON without Spark, and using the same
>>>>> format simplifies architecture.  Plus, some people want to check files 
>>>>> into
>>>>> version control, and JSON is nice for that.
>>>>> * The reader/writer APIs could be extended to take format parameters
>>>>> (just like DataFrame reader/writers) to handle JSON (and maybe, 
>>>>> eventually,
>>>>> handle Parquet in the online serving setting).
>>>>>
>>>>> This would be a big project, so proposing a SPIP might be best.  If
>>>>> people are around at the Spark Summit, that could be a good time to meet 
>>>>> up
>>>>> & then post notes back to the dev list.
>>>>>
>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>>> PipelineModel, and various ModelReader and SharedReadWrite 
>>>>>> implementations
>>>>>> that rely on SparkContext. This is a big blocker on reusing  trained 
>>>>>> models
>>>>>> outside of Spark for online serving.
>>>>>>
>>>>>> What’s the next step? Would folks be interested in getting together
>>>>>> to discuss/get some feedback?
>>>>>>
>>>>>>
>>>>>> _____________________________
>>>>>> From: Felix Cheung <[email protected]>
>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>> To: Holden Karau <[email protected]>, Joseph Bradley <
>>>>>> [email protected]>
>>>>>> Cc: dev <[email protected]>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Huge +1 on this!
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:*[email protected] <[email protected]> on behalf of
>>>>>> Holden Karau <[email protected]>
>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>> *To:* Joseph Bradley
>>>>>> *Cc:* dev
>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>>>>>
>>>>>>> Awesome! I'm glad other folks think something like this belongs in
>>>>>> Spark.
>>>>>>
>>>>>>> This was one of the original goals for mllib-local: to have local
>>>>>>> versions of MLlib models which could be deployed without the big Spark 
>>>>>>> JARs
>>>>>>> and without a SparkContext or SparkSession.  There are related 
>>>>>>> commercial
>>>>>>> offerings like this : ) but the overhead of maintaining those offerings 
>>>>>>> is
>>>>>>> pretty high.  Building good APIs within MLlib to avoid copying logic 
>>>>>>> across
>>>>>>> libraries will be well worth it.
>>>>>>>
>>>>>>> We've talked about this need at Databricks and have also been
>>>>>>> syncing with the creators of MLeap.  It'd be great to get this
>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>> * It'd be valuable to have this go beyond adding transform() methods
>>>>>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>>>>>> local, lightweight versions of models in mllib-local, outside of the 
>>>>>>> main
>>>>>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>>>>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>>>>>> utilize elements of Spark SQL, particularly Rows and Types, which could 
>>>>>>> be
>>>>>>> moved into a local sql package.
>>>>>>> * This architecture may require some awkward APIs currently to have
>>>>>>> model prediction logic in mllib-local, local model classes in 
>>>>>>> mllib-local,
>>>>>>> and regular (DataFrame-friendly) model classes in mllib.  We might find 
>>>>>>> it
>>>>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>>>>>> architecture while making it feasible for 3rd party developers to extend
>>>>>>> MLlib APIs (especially in Java).
>>>>>>>
>>>>>> I agree this could be interesting, and feed into the other discussion
>>>>>> around when (or if) we should be considering Spark 3.0
>>>>>> I _think_ we could probably do it with optional traits people could
>>>>>> mix in to avoid breaking the current APIs but I could be wrong on that
>>>>>> point.
>>>>>>
>>>>>>> * It could also be worth discussing local DataFrames.  They might
>>>>>>> not be as important as per-Row transformations, but they would be 
>>>>>>> helpful
>>>>>>> for batching for higher throughput.
>>>>>>>
>>>>>> That could be interesting as well.
>>>>>>
>>>>>>>
>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>
>>>>>>> Joseph
>>>>>>>
>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi y'all,
>>>>>>>>
>>>>>>>> With the renewed interest in ML in Apache Spark now seems like a
>>>>>>>> good a time as any to revisit the online serving situation in Spark 
>>>>>>>> ML. DB
>>>>>>>> & other's have done some excellent working moving a lot of the 
>>>>>>>> necessary
>>>>>>>> tools into a local linear algebra package that doesn't depend on 
>>>>>>>> having a
>>>>>>>> SparkContext.
>>>>>>>>
>>>>>>>> There are a few different commercial and non-commercial solutions
>>>>>>>> round this, but currently our individual transform/predict methods are
>>>>>>>> private so they either need to copy or re-implement (or put them 
>>>>>>>> selves in
>>>>>>>> org.apache.spark) to access them. How would folks feel about adding a 
>>>>>>>> new
>>>>>>>> trait for ML pipeline stages to expose to do transformation of single
>>>>>>>> element inputs (or local collections) that could be optionally 
>>>>>>>> implemented
>>>>>>>> by stages which support this? That way we can have less copy and paste 
>>>>>>>> code
>>>>>>>> possibly getting out of sync with our model training.
>>>>>>>>
>>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>>> projects is probably the right path, forward (folks have different 
>>>>>>>> needs),
>>>>>>>> but I'd love to see us make it simpler for other projects to build 
>>>>>>>> reliable
>>>>>>>> serving tools.
>>>>>>>>
>>>>>>>> I realize this maybe puts some of the folks in an awkward position
>>>>>>>> with their own commercial offerings, but hopefully if we make it 
>>>>>>>> easier for
>>>>>>>> everyone the commercial vendors can benefit as well.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Holden :)
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Joseph Bradley
>>>>>>>
>>>>>>> Software Engineer - Machine Learning
>>>>>>>
>>>>>>> Databricks, Inc.
>>>>>>>
>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Joseph Bradley
>>>>>
>>>>> Software Engineer - Machine Learning
>>>>>
>>>>> Databricks, Inc.
>>>>>
>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>
>>>> --
>>>> --
>>>> Cheers,
>>>> Leif
>>>>
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>>
>>
>
>
>

Re: Revisiting Online serving of Spark models?

Reply via email to