(Oh also the write API has already been extended to take formats).

On Mon, May 21, 2018 at 2:51 PM Holden Karau <hol...@pigscanfly.ca> wrote:

> I like that idea. I’ll be around Spark Summit.
>
> On Mon, May 21, 2018 at 1:52 PM Joseph Bradley <jos...@databricks.com>
> wrote:
>
>> Regarding model reading and writing, I'll give quick thoughts here:
>> * Our approach was to use the same format but write JSON instead of
>> Parquet.  It's easier to parse JSON without Spark, and using the same
>> format simplifies architecture.  Plus, some people want to check files into
>> version control, and JSON is nice for that.
>> * The reader/writer APIs could be extended to take format parameters
>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually,
>> handle Parquet in the online serving setting).
>>
>> This would be a big project, so proposing a SPIP might be best.  If
>> people are around at the Spark Summit, that could be a good time to meet up
>> & then post notes back to the dev list.
>>
>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <felixcheun...@hotmail.com>
>> wrote:
>>
>>> Specifically I’d like bring part of the discussion to Model and
>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>> that rely on SparkContext. This is a big blocker on reusing  trained models
>>> outside of Spark for online serving.
>>>
>>> What’s the next step? Would folks be interested in getting together to
>>> discuss/get some feedback?
>>>
>>>
>>> _____________________________
>>> From: Felix Cheung <felixcheun...@hotmail.com>
>>> Sent: Thursday, May 10, 2018 10:10 AM
>>> Subject: Re: Revisiting Online serving of Spark models?
>>> To: Holden Karau <hol...@pigscanfly.ca>, Joseph Bradley <
>>> jos...@databricks.com>
>>> Cc: dev <dev@spark.apache.org>
>>>
>>>
>>>
>>> Huge +1 on this!
>>>
>>> ------------------------------
>>> *From:* holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of
>>> Holden Karau <hol...@pigscanfly.ca>
>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>> *To:* Joseph Bradley
>>> *Cc:* dev
>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>
>>>
>>>
>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <jos...@databricks.com>
>>> wrote:
>>>
>>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>>
>>>> Awesome! I'm glad other folks think something like this belongs in
>>> Spark.
>>>
>>>> This was one of the original goals for mllib-local: to have local
>>>> versions of MLlib models which could be deployed without the big Spark JARs
>>>> and without a SparkContext or SparkSession.  There are related commercial
>>>> offerings like this : ) but the overhead of maintaining those offerings is
>>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>>> libraries will be well worth it.
>>>>
>>>> We've talked about this need at Databricks and have also been syncing
>>>> with the creators of MLeap.  It'd be great to get this functionality into
>>>> Spark itself.  Some thoughts:
>>>> * It'd be valuable to have this go beyond adding transform() methods
>>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>>> local, lightweight versions of models in mllib-local, outside of the main
>>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>>>> moved into a local sql package.
>>>> * This architecture may require some awkward APIs currently to have
>>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>>> and regular (DataFrame-friendly) model classes in mllib.  We might find it
>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>>> architecture while making it feasible for 3rd party developers to extend
>>>> MLlib APIs (especially in Java).
>>>>
>>> I agree this could be interesting, and feed into the other discussion
>>> around when (or if) we should be considering Spark 3.0
>>> I _think_ we could probably do it with optional traits people could mix
>>> in to avoid breaking the current APIs but I could be wrong on that point.
>>>
>>>> * It could also be worth discussing local DataFrames.  They might not
>>>> be as important as per-Row transformations, but they would be helpful for
>>>> batching for higher throughput.
>>>>
>>> That could be interesting as well.
>>>
>>>>
>>>> I'll be interested to hear others' thoughts too!
>>>>
>>>> Joseph
>>>>
>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Hi y'all,
>>>>>
>>>>> With the renewed interest in ML in Apache Spark now seems like a good
>>>>> a time as any to revisit the online serving situation in Spark ML. DB &
>>>>> other's have done some excellent working moving a lot of the necessary
>>>>> tools into a local linear algebra package that doesn't depend on having a
>>>>> SparkContext.
>>>>>
>>>>> There are a few different commercial and non-commercial solutions
>>>>> round this, but currently our individual transform/predict methods are
>>>>> private so they either need to copy or re-implement (or put them selves in
>>>>> org.apache.spark) to access them. How would folks feel about adding a new
>>>>> trait for ML pipeline stages to expose to do transformation of single
>>>>> element inputs (or local collections) that could be optionally implemented
>>>>> by stages which support this? That way we can have less copy and paste 
>>>>> code
>>>>> possibly getting out of sync with our model training.
>>>>>
>>>>> I think continuing to have on-line serving grow in different projects
>>>>> is probably the right path, forward (folks have different needs), but I'd
>>>>> love to see us make it simpler for other projects to build reliable 
>>>>> serving
>>>>> tools.
>>>>>
>>>>> I realize this maybe puts some of the folks in an awkward position
>>>>> with their own commercial offerings, but hopefully if we make it easier 
>>>>> for
>>>>> everyone the commercial vendors can benefit as well.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Holden :)
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Joseph Bradley
>>>>
>>>> Software Engineer - Machine Learning
>>>>
>>>> Databricks, Inc.
>>>>
>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>
>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
> --
> Twitter: https://twitter.com/holdenkarau
>
-- 
Twitter: https://twitter.com/holdenkarau

Reply via email to