On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jos...@databricks.com> wrote:
> Thanks for bringing this up Holden! I'm a strong supporter of this. > > Awesome! I'm glad other folks think something like this belongs in Spark. > This was one of the original goals for mllib-local: to have local versions > of MLlib models which could be deployed without the big Spark JARs and > without a SparkContext or SparkSession. There are related commercial > offerings like this : ) but the overhead of maintaining those offerings is > pretty high. Building good APIs within MLlib to avoid copying logic across > libraries will be well worth it. > > We've talked about this need at Databricks and have also been syncing with > the creators of MLeap. It'd be great to get this functionality into Spark > itself. Some thoughts: > * It'd be valuable to have this go beyond adding transform() methods > taking a Row to the current Models. Instead, it would be ideal to have > local, lightweight versions of models in mllib-local, outside of the main > mllib package (for easier deployment with smaller & fewer dependencies). > * Supporting Pipelines is important. For this, it would be ideal to > utilize elements of Spark SQL, particularly Rows and Types, which could be > moved into a local sql package. > * This architecture may require some awkward APIs currently to have model > prediction logic in mllib-local, local model classes in mllib-local, and > regular (DataFrame-friendly) model classes in mllib. We might find it > helpful to break some DeveloperApis in Spark 3.0 to facilitate this > architecture while making it feasible for 3rd party developers to extend > MLlib APIs (especially in Java). > I agree this could be interesting, and feed into the other discussion around when (or if) we should be considering Spark 3.0 I _think_ we could probably do it with optional traits people could mix in to avoid breaking the current APIs but I could be wrong on that point. > * It could also be worth discussing local DataFrames. They might not be > as important as per-Row transformations, but they would be helpful for > batching for higher throughput. > That could be interesting as well. > > I'll be interested to hear others' thoughts too! > > Joseph > > On Wed, May 9, 2018 at 7:18 AM, Holden Karau <hol...@pigscanfly.ca> wrote: > >> Hi y'all, >> >> With the renewed interest in ML in Apache Spark now seems like a good a >> time as any to revisit the online serving situation in Spark ML. DB & >> other's have done some excellent working moving a lot of the necessary >> tools into a local linear algebra package that doesn't depend on having a >> SparkContext. >> >> There are a few different commercial and non-commercial solutions round >> this, but currently our individual transform/predict methods are private so >> they either need to copy or re-implement (or put them selves in >> org.apache.spark) to access them. How would folks feel about adding a new >> trait for ML pipeline stages to expose to do transformation of single >> element inputs (or local collections) that could be optionally implemented >> by stages which support this? That way we can have less copy and paste code >> possibly getting out of sync with our model training. >> >> I think continuing to have on-line serving grow in different projects is >> probably the right path, forward (folks have different needs), but I'd love >> to see us make it simpler for other projects to build reliable serving >> tools. >> >> I realize this maybe puts some of the folks in an awkward position with >> their own commercial offerings, but hopefully if we make it easier for >> everyone the commercial vendors can benefit as well. >> >> Cheers, >> >> Holden :) >> >> -- >> Twitter: https://twitter.com/holdenkarau >> > > > > -- > > Joseph Bradley > > Software Engineer - Machine Learning > > Databricks, Inc. > > [image: http://databricks.com] <http://databricks.com/> > -- Twitter: https://twitter.com/holdenkarau