Hi,
It'd be great if there can be any sharing of the offline discussion. Thanks! Holden Karau wrote > We’re by the registration sign going to start walking over at 4:05 > > On Wed, Jun 6, 2018 at 2:43 PM Maximiliano Felice < > maximilianofelice@ >> wrote: > >> Hi! >> >> Do we meet at the entrance? >> >> See you >> >> >> El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath < >> > nick.pentreath@ >> escribió: >> >>> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it. >>> >>> On Sun, 3 Jun 2018 at 00:24 Holden Karau < > holden@ > > wrote: >>> >>>> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice < >>>> > maximilianofelice@ >> wrote: >>>> >>>>> Hi! >>>>> >>>>> We're already in San Francisco waiting for the summit. We even think >>>>> that we spotted @holdenk this afternoon. >>>>> >>>> Unless you happened to be walking by my garage probably not super >>>> likely, spent the day working on scooters/motorcycles (my style is a >>>> little >>>> less unique in SF :)). Also if you see me feel free to say hi unless I >>>> look >>>> like I haven't had my first coffee of the day, love chatting with folks >>>> IRL >>>> :) >>>> >>>>> >>>>> @chris, we're really interested in the Meetup you're hosting. My team >>>>> will probably join it since the beginning of you have room for us, and >>>>> I'll >>>>> join it later after discussing the topics on this thread. I'll send >>>>> you an >>>>> email regarding this request. >>>>> >>>>> Thanks >>>>> >>>>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal < >>>>> > sxk1969@ >> escribió: >>>>> >>>>>> @Chris This sounds fantastic, please send summary notes for Seattle >>>>>> folks >>>>>> >>>>>> @Felix I work in downtown Seattle, am wondering if we should a tech >>>>>> meetup around model serving in spark at my work or elsewhere close, >>>>>> thoughts? I’m actually in the midst of building microservices to >>>>>> manage >>>>>> models and when I say models I mean much more than machine learning >>>>>> models >>>>>> (think OR, process models as well) >>>>>> >>>>>> Regards >>>>>> >>>>>> Sent from my iPhone >>>>>> >>>>>> On May 31, 2018, at 10:32 PM, Chris Fregly < > chris@ > > wrote: >>>>>> >>>>>> Hey everyone! >>>>>> >>>>>> @Felix: thanks for putting this together. i sent some of you a >>>>>> quick >>>>>> calendar event - mostly for me, so i don’t forget! :) >>>>>> >>>>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and >>>>>> TensorFlow Meetup* >>>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/> >>>>>> @5:30pm >>>>>> on June 6th (same night) here in SF! >>>>>> >>>>>> Everybody is welcome to come. Here’s the link to the meetup that >>>>>> includes the signup link: >>>>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/* >>>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/> >>>>>> >>>>>> We have an awesome lineup of speakers covered a lot of deep, >>>>>> technical >>>>>> ground. >>>>>> >>>>>> For those who can’t attend in person, we’ll be broadcasting live - >>>>>> and >>>>>> posting the recording afterward. >>>>>> >>>>>> All details are in the meetup link above… >>>>>> >>>>>> @holden/felix/nick/joseph/maximiliano/saikat/leif: you’re more than >>>>>> welcome to give a talk. I can move things around to make room. >>>>>> >>>>>> @joseph: I’d personally like an update on the direction of the >>>>>> Databricks proprietary ML Serving export format which is similar to >>>>>> PMML >>>>>> but not a standard in any way. >>>>>> >>>>>> Also, the Databricks ML Serving Runtime is only available to >>>>>> Databricks customers. This seems in conflict with the community >>>>>> efforts >>>>>> described here. Can you comment on behalf of Databricks? >>>>>> >>>>>> Look forward to your response, joseph. >>>>>> >>>>>> See you all soon! >>>>>> >>>>>> — >>>>>> >>>>>> >>>>>> *Chris Fregly *Founder @ *PipelineAI* <https://pipeline.ai/> >>>>>> (100,000 >>>>>> Users) >>>>>> Organizer @ *Advanced Spark and TensorFlow Meetup* >>>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> >>>>>> (85,000 >>>>>> Global Members) >>>>>> >>>>>> >>>>>> >>>>>> *San Francisco - Chicago - Austin - >>>>>> Washington DC - London - Dusseldorf * >>>>>> *Try our PipelineAI Community Edition with GPUs and TPUs!! >>>>>> <http://community.pipeline.ai/>* >>>>>> >>>>>> >>>>>> On May 30, 2018, at 9:32 AM, Felix Cheung < > felixcheung_m@ > > >>>>>> wrote: >>>>>> >>>>>> Hi! >>>>>> >>>>>> Thank you! Let’s meet then >>>>>> >>>>>> June 6 4pm >>>>>> >>>>>> Moscone West Convention Center >>>>>> 800 Howard Street, San Francisco, CA 94103 >>>>>> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g> >>>>>> >>>>>> Ground floor (outside of conference area - should be available for >>>>>> all) - we will meet and decide where to go >>>>>> >>>>>> (Would not send invite because that would be too much noise for dev@) >>>>>> >>>>>> To paraphrase Joseph, we will use this to kick off the discusssion >>>>>> and >>>>>> post notes after and follow up online. As for Seattle, I would be >>>>>> very >>>>>> interested to meet in person lateen and discuss ;) >>>>>> >>>>>> >>>>>> _____________________________ >>>>>> From: Saikat Kanjilal < > sxk1969@ > > >>>>>> Sent: Tuesday, May 29, 2018 11:46 AM >>>>>> Subject: Re: Revisiting Online serving of Spark models? >>>>>> To: Maximiliano Felice < > maximilianofelice@ > > >>>>>> Cc: Felix Cheung < > felixcheung_m@ > >, Holden Karau < >>>>>> > holden@ >>, Joseph Bradley < > joseph@ > >, Leif >>>>>> Walsh < > leif.walsh@ > >, dev < > [email protected] > > >>>>>> >>>>>> >>>>>> Would love to join but am in Seattle, thoughts on how to make this >>>>>> work? >>>>>> >>>>>> Regards >>>>>> >>>>>> Sent from my iPhone >>>>>> >>>>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice < >>>>>> > maximilianofelice@ >> wrote: >>>>>> >>>>>> Big +1 to a meeting with fresh air. >>>>>> >>>>>> Could anyone send the invites? I don't really know which is the place >>>>>> Holden is talking about. >>>>>> >>>>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung < > felixcheung_m@ > >: >>>>>> >>>>>>> You had me at blue bottle! >>>>>>> >>>>>>> _____________________________ >>>>>>> From: Holden Karau < > holden@ > > >>>>>>> Sent: Tuesday, May 29, 2018 9:47 AM >>>>>>> Subject: Re: Revisiting Online serving of Spark models? >>>>>>> To: Felix Cheung < > felixcheung_m@ > > >>>>>>> Cc: Saikat Kanjilal < > sxk1969@ > >, Maximiliano Felice < >>>>>>> > maximilianofelice@ >>, Joseph Bradley < > joseph@ > >, >>>>>>> Leif Walsh < > leif.walsh@ > >, dev < > [email protected] > > >>>>>>> >>>>>>> >>>>>>> >>>>>>> I'm down for that, we could all go for a walk maybe to the mint >>>>>>> plazaa blue bottle and grab coffee (if the weather holds have our >>>>>>> design >>>>>>> meeting outside :p)? >>>>>>> >>>>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung < >>>>>>> > felixcheung_m@ >> wrote: >>>>>>> >>>>>>>> Bump. >>>>>>>> >>>>>>>> ------------------------------ >>>>>>>> *From:* Felix Cheung < > felixcheung_m@ > > >>>>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM >>>>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley >>>>>>>> *Cc:* Leif Walsh; Holden Karau; dev >>>>>>>> >>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models? >>>>>>>> >>>>>>>> Hi! How about we meet the community and discuss on June 6 4pm at >>>>>>>> (near) the Summit? >>>>>>>> >>>>>>>> (I propose we meet at the venue entrance so we could accommodate >>>>>>>> people might not be in the conference) >>>>>>>> >>>>>>>> ------------------------------ >>>>>>>> *From:* Saikat Kanjilal < > sxk1969@ > > >>>>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM >>>>>>>> *To:* Maximiliano Felice >>>>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev >>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models? >>>>>>>> >>>>>>>> I’m in the same exact boat as Maximiliano and have use cases as >>>>>>>> well >>>>>>>> for model serving and would love to join this discussion. >>>>>>>> >>>>>>>> Sent from my iPhone >>>>>>>> >>>>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice < >>>>>>>> > maximilianofelice@ >> wrote: >>>>>>>> >>>>>>>> Hi! >>>>>>>> >>>>>>>> I'm don't usually write a lot on this list but I keep up to date >>>>>>>> with the discussions and I'm a heavy user of Spark. This topic >>>>>>>> caught my >>>>>>>> attention, as we're currently facing this issue at work. I'm >>>>>>>> attending to >>>>>>>> the summit and was wondering if it would it be possible for me to >>>>>>>> join that >>>>>>>> meeting. I might be able to share some helpful usecases and ideas. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Maximiliano Felice >>>>>>>> >>>>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh < >>>>>>>> > leif.walsh@ >> escribió: >>>>>>>> >>>>>>>>> I’m with you on json being more readable than parquet, but we’ve >>>>>>>>> had success using pyarrow’s parquet reader and have been quite >>>>>>>>> happy with >>>>>>>>> it so far. If your target is python (and probably if not now, then >>>>>>>>> soon, >>>>>>>>> R), you should look in to it. >>>>>>>>> >>>>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley < > joseph@ > > >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Regarding model reading and writing, I'll give quick thoughts >>>>>>>>>> here: >>>>>>>>>> * Our approach was to use the same format but write JSON instead >>>>>>>>>> of Parquet. It's easier to parse JSON without Spark, and using >>>>>>>>>> the same >>>>>>>>>> format simplifies architecture. Plus, some people want to check >>>>>>>>>> files into >>>>>>>>>> version control, and JSON is nice for that. >>>>>>>>>> * The reader/writer APIs could be extended to take format >>>>>>>>>> parameters (just like DataFrame reader/writers) to handle JSON >>>>>>>>>> (and maybe, >>>>>>>>>> eventually, handle Parquet in the online serving setting). >>>>>>>>>> >>>>>>>>>> This would be a big project, so proposing a SPIP might be best. >>>>>>>>>> If people are around at the Spark Summit, that could be a good >>>>>>>>>> time to meet >>>>>>>>>> up & then post notes back to the dev list. >>>>>>>>>> >>>>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung < >>>>>>>>>> > felixcheung_m@ >> wrote: >>>>>>>>>> >>>>>>>>>>> Specifically I’d like bring part of the discussion to Model and >>>>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite >>>>>>>>>>> implementations >>>>>>>>>>> that rely on SparkContext. This is a big blocker on reusing >>>>>>>>>>> trained models >>>>>>>>>>> outside of Spark for online serving. >>>>>>>>>>> >>>>>>>>>>> What’s the next step? Would folks be interested in getting >>>>>>>>>>> together to discuss/get some feedback? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _____________________________ >>>>>>>>>>> From: Felix Cheung < > felixcheung_m@ > > >>>>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM >>>>>>>>>>> Subject: Re: Revisiting Online serving of Spark models? >>>>>>>>>>> To: Holden Karau < > holden@ > >, Joseph Bradley < >>>>>>>>>>> > joseph@ >> >>>>>>>>>>> Cc: dev < > [email protected] > > >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Huge +1 on this! >>>>>>>>>>> >>>>>>>>>>> ------------------------------ >>>>>>>>>>> *From:* > holden.karau@ > < > holden.karau@ > > on behalf >>>>>>>>>>> of Holden Karau < > holden@ > > >>>>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM >>>>>>>>>>> *To:* Joseph Bradley >>>>>>>>>>> *Cc:* dev >>>>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley < >>>>>>>>>>> > joseph@ >> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks for bringing this up Holden! I'm a strong supporter of >>>>>>>>>>>> this. >>>>>>>>>>>> >>>>>>>>>>>> Awesome! I'm glad other folks think something like this belongs >>>>>>>>>>> in Spark. >>>>>>>>>>> >>>>>>>>>>>> This was one of the original goals for mllib-local: to have >>>>>>>>>>>> local versions of MLlib models which could be deployed without >>>>>>>>>>>> the big >>>>>>>>>>>> Spark JARs and without a SparkContext or SparkSession. There >>>>>>>>>>>> are related >>>>>>>>>>>> commercial offerings like this : ) but the overhead of >>>>>>>>>>>> maintaining those >>>>>>>>>>>> offerings is pretty high. Building good APIs within MLlib to >>>>>>>>>>>> avoid copying >>>>>>>>>>>> logic across libraries will be well worth it. >>>>>>>>>>>> >>>>>>>>>>>> We've talked about this need at Databricks and have also been >>>>>>>>>>>> syncing with the creators of MLeap. It'd be great to get this >>>>>>>>>>>> functionality into Spark itself. Some thoughts: >>>>>>>>>>>> * It'd be valuable to have this go beyond adding transform() >>>>>>>>>>>> methods taking a Row to the current Models. Instead, it would >>>>>>>>>>>> be ideal to >>>>>>>>>>>> have local, lightweight versions of models in mllib-local, >>>>>>>>>>>> outside of the >>>>>>>>>>>> main mllib package (for easier deployment with smaller & fewer >>>>>>>>>>>> dependencies). >>>>>>>>>>>> * Supporting Pipelines is important. For this, it would be >>>>>>>>>>>> ideal to utilize elements of Spark SQL, particularly Rows and >>>>>>>>>>>> Types, which >>>>>>>>>>>> could be moved into a local sql package. >>>>>>>>>>>> * This architecture may require some awkward APIs currently to >>>>>>>>>>>> have model prediction logic in mllib-local, local model classes >>>>>>>>>>>> in >>>>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model classes in >>>>>>>>>>>> mllib. We >>>>>>>>>>>> might find it helpful to break some DeveloperApis in Spark 3.0 >>>>>>>>>>>> to >>>>>>>>>>>> facilitate this architecture while making it feasible for 3rd >>>>>>>>>>>> party >>>>>>>>>>>> developers to extend MLlib APIs (especially in Java). >>>>>>>>>>>> >>>>>>>>>>> I agree this could be interesting, and feed into the other >>>>>>>>>>> discussion around when (or if) we should be considering Spark >>>>>>>>>>> 3.0 >>>>>>>>>>> I _think_ we could probably do it with optional traits people >>>>>>>>>>> could mix in to avoid breaking the current APIs but I could be >>>>>>>>>>> wrong on >>>>>>>>>>> that point. >>>>>>>>>>> >>>>>>>>>>>> * It could also be worth discussing local DataFrames. They >>>>>>>>>>>> might not be as important as per-Row transformations, but they >>>>>>>>>>>> would be >>>>>>>>>>>> helpful for batching for higher throughput. >>>>>>>>>>>> >>>>>>>>>>> That could be interesting as well. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'll be interested to hear others' thoughts too! >>>>>>>>>>>> >>>>>>>>>>>> Joseph >>>>>>>>>>>> >>>>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau < >>>>>>>>>>>> > holden@ >> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi y'all, >>>>>>>>>>>>> >>>>>>>>>>>>> With the renewed interest in ML in Apache Spark now seems like >>>>>>>>>>>>> a good a time as any to revisit the online serving situation >>>>>>>>>>>>> in Spark ML. >>>>>>>>>>>>> DB & other's have done some excellent working moving a lot of >>>>>>>>>>>>> the necessary >>>>>>>>>>>>> tools into a local linear algebra package that doesn't depend >>>>>>>>>>>>> on having a >>>>>>>>>>>>> SparkContext. >>>>>>>>>>>>> >>>>>>>>>>>>> There are a few different commercial and non-commercial >>>>>>>>>>>>> solutions round this, but currently our individual >>>>>>>>>>>>> transform/predict >>>>>>>>>>>>> methods are private so they either need to copy or >>>>>>>>>>>>> re-implement (or put >>>>>>>>>>>>> them selves in org.apache.spark) to access them. How would >>>>>>>>>>>>> folks feel about >>>>>>>>>>>>> adding a new trait for ML pipeline stages to expose to do >>>>>>>>>>>>> transformation of >>>>>>>>>>>>> single element inputs (or local collections) that could be >>>>>>>>>>>>> optionally >>>>>>>>>>>>> implemented by stages which support this? That way we can have >>>>>>>>>>>>> less copy >>>>>>>>>>>>> and paste code possibly getting out of sync with our model >>>>>>>>>>>>> training. >>>>>>>>>>>>> >>>>>>>>>>>>> I think continuing to have on-line serving grow in different >>>>>>>>>>>>> projects is probably the right path, forward (folks have >>>>>>>>>>>>> different needs), >>>>>>>>>>>>> but I'd love to see us make it simpler for other projects to >>>>>>>>>>>>> build reliable >>>>>>>>>>>>> serving tools. >>>>>>>>>>>>> >>>>>>>>>>>>> I realize this maybe puts some of the folks in an awkward >>>>>>>>>>>>> position with their own commercial offerings, but hopefully if >>>>>>>>>>>>> we make it >>>>>>>>>>>>> easier for everyone the commercial vendors can benefit as >>>>>>>>>>>>> well. >>>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> >>>>>>>>>>>>> Holden :) >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Joseph Bradley >>>>>>>>>>>> Software Engineer - Machine Learning >>>>>>>>>>>> Databricks, Inc. >>>>>>>>>>>> [image: http://databricks.com] <http://databricks.com/> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Joseph Bradley >>>>>>>>>> Software Engineer - Machine Learning >>>>>>>>>> Databricks, Inc. >>>>>>>>>> [image: http://databricks.com] <http://databricks.com/> >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> -- >>>>>>>>> Cheers, >>>>>>>>> Leif >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>>> >>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> >>> -- > Twitter: https://twitter.com/holdenkarau -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
