(Oh also the write API has already been extended to take formats). On Mon, May 21, 2018 at 2:51 PM Holden Karau <hol...@pigscanfly.ca> wrote:
> I like that idea. I’ll be around Spark Summit. > > On Mon, May 21, 2018 at 1:52 PM Joseph Bradley <jos...@databricks.com> > wrote: > >> Regarding model reading and writing, I'll give quick thoughts here: >> * Our approach was to use the same format but write JSON instead of >> Parquet. It's easier to parse JSON without Spark, and using the same >> format simplifies architecture. Plus, some people want to check files into >> version control, and JSON is nice for that. >> * The reader/writer APIs could be extended to take format parameters >> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually, >> handle Parquet in the online serving setting). >> >> This would be a big project, so proposing a SPIP might be best. If >> people are around at the Spark Summit, that could be a good time to meet up >> & then post notes back to the dev list. >> >> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <felixcheun...@hotmail.com> >> wrote: >> >>> Specifically I’d like bring part of the discussion to Model and >>> PipelineModel, and various ModelReader and SharedReadWrite implementations >>> that rely on SparkContext. This is a big blocker on reusing trained models >>> outside of Spark for online serving. >>> >>> What’s the next step? Would folks be interested in getting together to >>> discuss/get some feedback? >>> >>> >>> _____________________________ >>> From: Felix Cheung <felixcheun...@hotmail.com> >>> Sent: Thursday, May 10, 2018 10:10 AM >>> Subject: Re: Revisiting Online serving of Spark models? >>> To: Holden Karau <hol...@pigscanfly.ca>, Joseph Bradley < >>> jos...@databricks.com> >>> Cc: dev <dev@spark.apache.org> >>> >>> >>> >>> Huge +1 on this! >>> >>> ------------------------------ >>> *From:* holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of >>> Holden Karau <hol...@pigscanfly.ca> >>> *Sent:* Thursday, May 10, 2018 9:39:26 AM >>> *To:* Joseph Bradley >>> *Cc:* dev >>> *Subject:* Re: Revisiting Online serving of Spark models? >>> >>> >>> >>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jos...@databricks.com> >>> wrote: >>> >>>> Thanks for bringing this up Holden! I'm a strong supporter of this. >>>> >>>> Awesome! I'm glad other folks think something like this belongs in >>> Spark. >>> >>>> This was one of the original goals for mllib-local: to have local >>>> versions of MLlib models which could be deployed without the big Spark JARs >>>> and without a SparkContext or SparkSession. There are related commercial >>>> offerings like this : ) but the overhead of maintaining those offerings is >>>> pretty high. Building good APIs within MLlib to avoid copying logic across >>>> libraries will be well worth it. >>>> >>>> We've talked about this need at Databricks and have also been syncing >>>> with the creators of MLeap. It'd be great to get this functionality into >>>> Spark itself. Some thoughts: >>>> * It'd be valuable to have this go beyond adding transform() methods >>>> taking a Row to the current Models. Instead, it would be ideal to have >>>> local, lightweight versions of models in mllib-local, outside of the main >>>> mllib package (for easier deployment with smaller & fewer dependencies). >>>> * Supporting Pipelines is important. For this, it would be ideal to >>>> utilize elements of Spark SQL, particularly Rows and Types, which could be >>>> moved into a local sql package. >>>> * This architecture may require some awkward APIs currently to have >>>> model prediction logic in mllib-local, local model classes in mllib-local, >>>> and regular (DataFrame-friendly) model classes in mllib. We might find it >>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this >>>> architecture while making it feasible for 3rd party developers to extend >>>> MLlib APIs (especially in Java). >>>> >>> I agree this could be interesting, and feed into the other discussion >>> around when (or if) we should be considering Spark 3.0 >>> I _think_ we could probably do it with optional traits people could mix >>> in to avoid breaking the current APIs but I could be wrong on that point. >>> >>>> * It could also be worth discussing local DataFrames. They might not >>>> be as important as per-Row transformations, but they would be helpful for >>>> batching for higher throughput. >>>> >>> That could be interesting as well. >>> >>>> >>>> I'll be interested to hear others' thoughts too! >>>> >>>> Joseph >>>> >>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <hol...@pigscanfly.ca> >>>> wrote: >>>> >>>>> Hi y'all, >>>>> >>>>> With the renewed interest in ML in Apache Spark now seems like a good >>>>> a time as any to revisit the online serving situation in Spark ML. DB & >>>>> other's have done some excellent working moving a lot of the necessary >>>>> tools into a local linear algebra package that doesn't depend on having a >>>>> SparkContext. >>>>> >>>>> There are a few different commercial and non-commercial solutions >>>>> round this, but currently our individual transform/predict methods are >>>>> private so they either need to copy or re-implement (or put them selves in >>>>> org.apache.spark) to access them. How would folks feel about adding a new >>>>> trait for ML pipeline stages to expose to do transformation of single >>>>> element inputs (or local collections) that could be optionally implemented >>>>> by stages which support this? That way we can have less copy and paste >>>>> code >>>>> possibly getting out of sync with our model training. >>>>> >>>>> I think continuing to have on-line serving grow in different projects >>>>> is probably the right path, forward (folks have different needs), but I'd >>>>> love to see us make it simpler for other projects to build reliable >>>>> serving >>>>> tools. >>>>> >>>>> I realize this maybe puts some of the folks in an awkward position >>>>> with their own commercial offerings, but hopefully if we make it easier >>>>> for >>>>> everyone the commercial vendors can benefit as well. >>>>> >>>>> Cheers, >>>>> >>>>> Holden :) >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Joseph Bradley >>>> >>>> Software Engineer - Machine Learning >>>> >>>> Databricks, Inc. >>>> >>>> [image: http://databricks.com] <http://databricks.com/> >>>> >>> >>> >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> >>> >>> >> >> >> -- >> >> Joseph Bradley >> >> Software Engineer - Machine Learning >> >> Databricks, Inc. >> >> [image: http://databricks.com] <http://databricks.com/> >> > -- > Twitter: https://twitter.com/holdenkarau > -- Twitter: https://twitter.com/holdenkarau