I most likely will not be able to join SF next week but definitely up for a session after Summit in Seattle to dive further into this, eh?!
On Wed, May 30, 2018 at 9:32 AM Felix Cheung <[email protected]> wrote: > Hi! > > Thank you! Let’s meet then > > June 6 4pm > > Moscone West Convention Center > 800 Howard Street, San Francisco, CA 94103 > <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g> > > Ground floor (outside of conference area - should be available for all) - > we will meet and decide where to go > > (Would not send invite because that would be too much noise for dev@) > > To paraphrase Joseph, we will use this to kick off the discusssion and > post notes after and follow up online. As for Seattle, I would be very > interested to meet in person lateen and discuss ;) > > > _____________________________ > From: Saikat Kanjilal <[email protected]> > Sent: Tuesday, May 29, 2018 11:46 AM > > Subject: Re: Revisiting Online serving of Spark models? > To: Maximiliano Felice <[email protected]> > Cc: Felix Cheung <[email protected]>, Holden Karau < > [email protected]>, Joseph Bradley <[email protected]>, Leif Walsh > <[email protected]>, dev <[email protected]> > > > > Would love to join but am in Seattle, thoughts on how to make this work? > > Regards > > Sent from my iPhone > > On May 29, 2018, at 10:35 AM, Maximiliano Felice < > [email protected]> wrote: > > Big +1 to a meeting with fresh air. > > Could anyone send the invites? I don't really know which is the place > Holden is talking about. > > 2018-05-29 14:27 GMT-03:00 Felix Cheung <[email protected]>: > >> You had me at blue bottle! >> >> _____________________________ >> From: Holden Karau <[email protected]> >> Sent: Tuesday, May 29, 2018 9:47 AM >> Subject: Re: Revisiting Online serving of Spark models? >> To: Felix Cheung <[email protected]> >> Cc: Saikat Kanjilal <[email protected]>, Maximiliano Felice < >> [email protected]>, Joseph Bradley <[email protected]>, >> Leif Walsh <[email protected]>, dev <[email protected]> >> >> >> >> I'm down for that, we could all go for a walk maybe to the mint plazaa >> blue bottle and grab coffee (if the weather holds have our design meeting >> outside :p)? >> >> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <[email protected]> >> wrote: >> >>> Bump. >>> >>> ------------------------------ >>> *From:* Felix Cheung <[email protected]> >>> *Sent:* Saturday, May 26, 2018 1:05:29 PM >>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley >>> *Cc:* Leif Walsh; Holden Karau; dev >>> >>> *Subject:* Re: Revisiting Online serving of Spark models? >>> >>> Hi! How about we meet the community and discuss on June 6 4pm at (near) >>> the Summit? >>> >>> (I propose we meet at the venue entrance so we could accommodate people >>> might not be in the conference) >>> >>> ------------------------------ >>> *From:* Saikat Kanjilal <[email protected]> >>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM >>> *To:* Maximiliano Felice >>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev >>> *Subject:* Re: Revisiting Online serving of Spark models? >>> >>> I’m in the same exact boat as Maximiliano and have use cases as well for >>> model serving and would love to join this discussion. >>> >>> Sent from my iPhone >>> >>> On May 22, 2018, at 6:39 AM, Maximiliano Felice < >>> [email protected]> wrote: >>> >>> Hi! >>> >>> I'm don't usually write a lot on this list but I keep up to date with >>> the discussions and I'm a heavy user of Spark. This topic caught my >>> attention, as we're currently facing this issue at work. I'm attending to >>> the summit and was wondering if it would it be possible for me to join that >>> meeting. I might be able to share some helpful usecases and ideas. >>> >>> Thanks, >>> Maximiliano Felice >>> >>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <[email protected]> >>> escribió: >>> >>>> I’m with you on json being more readable than parquet, but we’ve had >>>> success using pyarrow’s parquet reader and have been quite happy with it so >>>> far. If your target is python (and probably if not now, then soon, R), you >>>> should look in to it. >>>> >>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <[email protected]> >>>> wrote: >>>> >>>>> Regarding model reading and writing, I'll give quick thoughts here: >>>>> * Our approach was to use the same format but write JSON instead of >>>>> Parquet. It's easier to parse JSON without Spark, and using the same >>>>> format simplifies architecture. Plus, some people want to check files >>>>> into >>>>> version control, and JSON is nice for that. >>>>> * The reader/writer APIs could be extended to take format parameters >>>>> (just like DataFrame reader/writers) to handle JSON (and maybe, >>>>> eventually, >>>>> handle Parquet in the online serving setting). >>>>> >>>>> This would be a big project, so proposing a SPIP might be best. If >>>>> people are around at the Spark Summit, that could be a good time to meet >>>>> up >>>>> & then post notes back to the dev list. >>>>> >>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung < >>>>> [email protected]> wrote: >>>>> >>>>>> Specifically I’d like bring part of the discussion to Model and >>>>>> PipelineModel, and various ModelReader and SharedReadWrite >>>>>> implementations >>>>>> that rely on SparkContext. This is a big blocker on reusing trained >>>>>> models >>>>>> outside of Spark for online serving. >>>>>> >>>>>> What’s the next step? Would folks be interested in getting together >>>>>> to discuss/get some feedback? >>>>>> >>>>>> >>>>>> _____________________________ >>>>>> From: Felix Cheung <[email protected]> >>>>>> Sent: Thursday, May 10, 2018 10:10 AM >>>>>> Subject: Re: Revisiting Online serving of Spark models? >>>>>> To: Holden Karau <[email protected]>, Joseph Bradley < >>>>>> [email protected]> >>>>>> Cc: dev <[email protected]> >>>>>> >>>>>> >>>>>> >>>>>> Huge +1 on this! >>>>>> >>>>>> ------------------------------ >>>>>> *From:*[email protected] <[email protected]> on behalf of >>>>>> Holden Karau <[email protected]> >>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM >>>>>> *To:* Joseph Bradley >>>>>> *Cc:* dev >>>>>> *Subject:* Re: Revisiting Online serving of Spark models? >>>>>> >>>>>> >>>>>> >>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Thanks for bringing this up Holden! I'm a strong supporter of this. >>>>>>> >>>>>>> Awesome! I'm glad other folks think something like this belongs in >>>>>> Spark. >>>>>> >>>>>>> This was one of the original goals for mllib-local: to have local >>>>>>> versions of MLlib models which could be deployed without the big Spark >>>>>>> JARs >>>>>>> and without a SparkContext or SparkSession. There are related >>>>>>> commercial >>>>>>> offerings like this : ) but the overhead of maintaining those offerings >>>>>>> is >>>>>>> pretty high. Building good APIs within MLlib to avoid copying logic >>>>>>> across >>>>>>> libraries will be well worth it. >>>>>>> >>>>>>> We've talked about this need at Databricks and have also been >>>>>>> syncing with the creators of MLeap. It'd be great to get this >>>>>>> functionality into Spark itself. Some thoughts: >>>>>>> * It'd be valuable to have this go beyond adding transform() methods >>>>>>> taking a Row to the current Models. Instead, it would be ideal to have >>>>>>> local, lightweight versions of models in mllib-local, outside of the >>>>>>> main >>>>>>> mllib package (for easier deployment with smaller & fewer dependencies). >>>>>>> * Supporting Pipelines is important. For this, it would be ideal to >>>>>>> utilize elements of Spark SQL, particularly Rows and Types, which could >>>>>>> be >>>>>>> moved into a local sql package. >>>>>>> * This architecture may require some awkward APIs currently to have >>>>>>> model prediction logic in mllib-local, local model classes in >>>>>>> mllib-local, >>>>>>> and regular (DataFrame-friendly) model classes in mllib. We might find >>>>>>> it >>>>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this >>>>>>> architecture while making it feasible for 3rd party developers to extend >>>>>>> MLlib APIs (especially in Java). >>>>>>> >>>>>> I agree this could be interesting, and feed into the other discussion >>>>>> around when (or if) we should be considering Spark 3.0 >>>>>> I _think_ we could probably do it with optional traits people could >>>>>> mix in to avoid breaking the current APIs but I could be wrong on that >>>>>> point. >>>>>> >>>>>>> * It could also be worth discussing local DataFrames. They might >>>>>>> not be as important as per-Row transformations, but they would be >>>>>>> helpful >>>>>>> for batching for higher throughput. >>>>>>> >>>>>> That could be interesting as well. >>>>>> >>>>>>> >>>>>>> I'll be interested to hear others' thoughts too! >>>>>>> >>>>>>> Joseph >>>>>>> >>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi y'all, >>>>>>>> >>>>>>>> With the renewed interest in ML in Apache Spark now seems like a >>>>>>>> good a time as any to revisit the online serving situation in Spark >>>>>>>> ML. DB >>>>>>>> & other's have done some excellent working moving a lot of the >>>>>>>> necessary >>>>>>>> tools into a local linear algebra package that doesn't depend on >>>>>>>> having a >>>>>>>> SparkContext. >>>>>>>> >>>>>>>> There are a few different commercial and non-commercial solutions >>>>>>>> round this, but currently our individual transform/predict methods are >>>>>>>> private so they either need to copy or re-implement (or put them >>>>>>>> selves in >>>>>>>> org.apache.spark) to access them. How would folks feel about adding a >>>>>>>> new >>>>>>>> trait for ML pipeline stages to expose to do transformation of single >>>>>>>> element inputs (or local collections) that could be optionally >>>>>>>> implemented >>>>>>>> by stages which support this? That way we can have less copy and paste >>>>>>>> code >>>>>>>> possibly getting out of sync with our model training. >>>>>>>> >>>>>>>> I think continuing to have on-line serving grow in different >>>>>>>> projects is probably the right path, forward (folks have different >>>>>>>> needs), >>>>>>>> but I'd love to see us make it simpler for other projects to build >>>>>>>> reliable >>>>>>>> serving tools. >>>>>>>> >>>>>>>> I realize this maybe puts some of the folks in an awkward position >>>>>>>> with their own commercial offerings, but hopefully if we make it >>>>>>>> easier for >>>>>>>> everyone the commercial vendors can benefit as well. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Holden :) >>>>>>>> >>>>>>>> -- >>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Joseph Bradley >>>>>>> >>>>>>> Software Engineer - Machine Learning >>>>>>> >>>>>>> Databricks, Inc. >>>>>>> >>>>>>> [image: http://databricks.com] <http://databricks.com/> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Joseph Bradley >>>>> >>>>> Software Engineer - Machine Learning >>>>> >>>>> Databricks, Inc. >>>>> >>>>> [image: http://databricks.com] <http://databricks.com/> >>>>> >>>> -- >>>> -- >>>> Cheers, >>>> Leif >>>> >>> >> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> >> >> > > >
