On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jos...@databricks.com>
wrote:

> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>
> Awesome! I'm glad other folks think something like this belongs in Spark.

> This was one of the original goals for mllib-local: to have local versions
> of MLlib models which could be deployed without the big Spark JARs and
> without a SparkContext or SparkSession.  There are related commercial
> offerings like this : ) but the overhead of maintaining those offerings is
> pretty high.  Building good APIs within MLlib to avoid copying logic across
> libraries will be well worth it.
>
> We've talked about this need at Databricks and have also been syncing with
> the creators of MLeap.  It'd be great to get this functionality into Spark
> itself.  Some thoughts:
> * It'd be valuable to have this go beyond adding transform() methods
> taking a Row to the current Models.  Instead, it would be ideal to have
> local, lightweight versions of models in mllib-local, outside of the main
> mllib package (for easier deployment with smaller & fewer dependencies).
> * Supporting Pipelines is important.  For this, it would be ideal to
> utilize elements of Spark SQL, particularly Rows and Types, which could be
> moved into a local sql package.
> * This architecture may require some awkward APIs currently to have model
> prediction logic in mllib-local, local model classes in mllib-local, and
> regular (DataFrame-friendly) model classes in mllib.  We might find it
> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
> architecture while making it feasible for 3rd party developers to extend
> MLlib APIs (especially in Java).
>
I agree this could be interesting, and feed into the other discussion
around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in
to avoid breaking the current APIs but I could be wrong on that point.

> * It could also be worth discussing local DataFrames.  They might not be
> as important as per-Row transformations, but they would be helpful for
> batching for higher throughput.
>
That could be interesting as well.

>
> I'll be interested to hear others' thoughts too!
>
> Joseph
>
> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> Hi y'all,
>>
>> With the renewed interest in ML in Apache Spark now seems like a good a
>> time as any to revisit the online serving situation in Spark ML. DB &
>> other's have done some excellent working moving a lot of the necessary
>> tools into a local linear algebra package that doesn't depend on having a
>> SparkContext.
>>
>> There are a few different commercial and non-commercial solutions round
>> this, but currently our individual transform/predict methods are private so
>> they either need to copy or re-implement (or put them selves in
>> org.apache.spark) to access them. How would folks feel about adding a new
>> trait for ML pipeline stages to expose to do transformation of single
>> element inputs (or local collections) that could be optionally implemented
>> by stages which support this? That way we can have less copy and paste code
>> possibly getting out of sync with our model training.
>>
>> I think continuing to have on-line serving grow in different projects is
>> probably the right path, forward (folks have different needs), but I'd love
>> to see us make it simpler for other projects to build reliable serving
>> tools.
>>
>> I realize this maybe puts some of the folks in an awkward position with
>> their own commercial offerings, but hopefully if we make it easier for
>> everyone the commercial vendors can benefit as well.
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>



-- 
Twitter: https://twitter.com/holdenkarau

Reply via email to