Re: Adding abstraction in MLlib

Egor Pahomov Fri, 12 Sep 2014 05:39:49 -0700

Sorry, I misswrote  - I meant learners part of framework - models already
exists.


2014-09-12 15:53 GMT+04:00 Christoph Sawade <christoph.saw...@googlemail.com
>:

> I totally agree, and we discovered also some drawbacks with the
> classification models implementation that are based on GLMs:
>
> - There is no distinction between predicting scores, classes, and
> calibrated scores (probabilities). For these models it is common to have
> access to all of them and the prediction function ``predict``should be
> consistent and stateless. Currently, the score is only available after
> removing the threshold from the model.
> - There is no distinction between multinomial and binomial classification.
> For multinomial problems, it is necessary to handle multiple weight vectors
> and multiple confidences.
> - Models are not serialisable, which makes it hard to use them in practise.
>
> I started a pull request [1] some time ago. I would be happy to continue
> the discussion and clarify the interfaces, too!
>
> Cheers, Christoph
>
> [1] https://github.com/apache/spark/pull/2137/
>
> 2014-09-12 11:11 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>:
>
>> Here in Yandex, during implementation of gradient boosting in spark and
>> creating our ML tool for internal use, we found next serious problems in
>> MLLib:
>>
>>
>>    - There is no Regression/Classification model abstraction. We were
>>    building abstract data processing pipelines, which should work just
>> with
>>    some regression - exact algorithm specified outside this code. There
>> is no
>>    abstraction, which will allow me to do that. *(It's main reason for all
>>    further problems) *
>>    - There is no common practice among MLlib for testing algorithms: every
>>    model generates it's own random test data. There is no easy extractable
>>    test cases applible to another algorithm. There is no benchmarks for
>>    comparing algorithms. After implementing new algorithm it's very hard
>> to
>>    understand how it should be tested.
>>    - Lack of serialization testing: MLlib algorithms don't contain tests
>>    which test that model work after serialization.
>>    - During implementation of new algorithm it's hard to understand what
>>    API you should create and which interface to implement.
>>
>> Start for solving all these problems must be done in creating common
>> interface for typical algorithms/models - regression, classification,
>> clustering, collaborative filtering.
>>
>> All main tests should be written against these interfaces, so when new
>> algorithm implemented - all it should do is passed already written tests.
>> It allow us to have managble quality among all lib.
>>
>> There should be couple benchmarks which allow new spark user to get
>> feeling
>> about which algorithm to use.
>>
>> Test set against these abstractions should contain serialization test. In
>> production most time there is no need in model, which can't be stored.
>>
>> As the first step of this roadmap I'd like to create trait
>> RegressionModel,
>> *ADD* methods to current algorithms to implement this trait and create
>> some
>> tests against it. Planning of doing it next week.
>>
>> Purpose of this letter is to collect any objections to this approach on
>> early stage: please give any feedback. Second reason is to set lock on
>> this
>> activity so we wouldn't do the same thing twice: I'll create pull request
>> by the end of the next week and any parallalizm in development we can
>> start
>> from there.
>>
>>
>>
>> --
>>
>>
>>
>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>>
>
>


-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*

Re: Adding abstraction in MLlib

Reply via email to