Re: Adding abstraction in MLlib

Xiangrui Meng Tue, 16 Sep 2014 23:54:07 -0700

Hi Egor,

I posted the design doc for pipeline and parameters on the JIRA, now
I'm trying to work out some details of ML datasets, which I will post
it later this week. You feedback is welcome!


Best,
Xiangrui

On Mon, Sep 15, 2014 at 12:44 AM, Reynold Xin <[email protected]> wrote:
> Hi Egor,
>
> Thanks for the suggestion. It is definitely our intention and practice to
> post design docs as soon as they are ready, and short iteration cycles. As a
> matter of fact, we encourage design docs for major features posted before
> implementation starts, and WIP pull requests before they are fully baked for
> large features.
>
> That said, no, not 100% of a committer's time is on a specific ticket. There
> are lots of tickets that are open for a long time before somebody starts
> actively working on it. So no, it is not true that "all this time was active
> development". Xiangrui should post the design doc as soon as it is ready for
> feedback.
>
>
>
> On Sun, Sep 14, 2014 at 11:26 PM, Egor Pahomov <[email protected]>
> wrote:
>>
>> It's good, that databricks working on this issue! However current process
>> of working on that is not very clear for outsider.
>>
>> Last update on this ticket is August 5. If all this time was active
>> development, I have concerns that without feedback from community for such
>> long time development can fall in wrong way.
>> Even if it would be great big patch as soon as you introduce new
>> interfaces to community it would allow us to start working on our pipeline
>> code. It would allow us write algorithm in new paradigm instead of in lack
>> of any paradigms like it was before. It would allow us to help you transfer
>> old code to new paradigm.
>>
>> My main point - shorter iterations with more transparency.
>>
>> I think it would be good idea to create some pull request with code, which
>> you have so far, even if it doesn't pass tests, so just we can comment on it
>> before formulating it in design doc.
>>
>>
>> 2014-09-13 0:00 GMT+04:00 Patrick Wendell <[email protected]>:
>>>
>>> We typically post design docs on JIRA's before major work starts. For
>>> instance, pretty sure SPARk-1856 will have a design doc posted
>>> shortly.
>>>
>>> On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson <[email protected]> wrote:
>>> >
>>> > Are interface designs being captured anywhere as documents that the
>>> > community can follow along with as the proposals evolve?
>>> >
>>> > I've worked on other open source projects where design docs were
>>> > published as "living documents" (e.g. on google docs, or etherpad, but the
>>> > particular mechanism isn't crucial).   FWIW, I found that to be a good way
>>> > to work in a community environment.
>>> >
>>> >
>>> > ----- Original Message -----
>>> >> Hi Egor,
>>> >>
>>> >> Thanks for the feedback! We are aware of some of the issues you
>>> >> mentioned and there are JIRAs created for them. Specifically, I'm
>>> >> pushing out the design on pipeline features and algorithm/model
>>> >> parameters this week. We can move our discussion to
>>> >> https://issues.apache.org/jira/browse/SPARK-1856 .
>>> >>
>>> >> It would be nice to make tests against interfaces. But it definitely
>>> >> needs more discussion before making PRs. For example, we discussed the
>>> >> learning interfaces in Christoph's PR
>>> >> (https://github.com/apache/spark/pull/2137/) but it takes time to
>>> >> reach a consensus, especially on interfaces. Hopefully all of us could
>>> >> benefit from the discussion. The best practice is to break down the
>>> >> proposal into small independent piece and discuss them on the JIRA
>>> >> before submitting PRs.
>>> >>
>>> >> For performance tests, there is a spark-perf package
>>> >> (https://github.com/databricks/spark-perf) and we added performance
>>> >> tests for MLlib in v1.1. But definitely more work needs to be done.
>>> >>
>>> >> The dev-list may not be a good place for discussion on the design,
>>> >> could you create JIRAs for each of the issues you pointed out, and we
>>> >> track the discussion on JIRA? Thanks!
>>> >>
>>> >> Best,
>>> >> Xiangrui
>>> >>
>>> >> On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin <[email protected]>
>>> >> wrote:
>>> >> > Xiangrui can comment more, but I believe Joseph and him are actually
>>> >> > working on standardize interface and pipeline feature for 1.2
>>> >> > release.
>>> >> >
>>> >> > On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov
>>> >> > <[email protected]>
>>> >> > wrote:
>>> >> >
>>> >> >> Some architect suggestions on this matter -
>>> >> >> https://github.com/apache/spark/pull/2371
>>> >> >>
>>> >> >> 2014-09-12 16:38 GMT+04:00 Egor Pahomov <[email protected]>:
>>> >> >>
>>> >> >> > Sorry, I misswrote  - I meant learners part of framework - models
>>> >> >> > already
>>> >> >> > exists.
>>> >> >> >
>>> >> >> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
>>> >> >> > [email protected]>:
>>> >> >> >
>>> >> >> >> I totally agree, and we discovered also some drawbacks with the
>>> >> >> >> classification models implementation that are based on GLMs:
>>> >> >> >>
>>> >> >> >> - There is no distinction between predicting scores, classes,
>>> >> >> >> and
>>> >> >> >> calibrated scores (probabilities). For these models it is common
>>> >> >> >> to
>>> >> >> >> have
>>> >> >> >> access to all of them and the prediction function
>>> >> >> >> ``predict``should be
>>> >> >> >> consistent and stateless. Currently, the score is only available
>>> >> >> >> after
>>> >> >> >> removing the threshold from the model.
>>> >> >> >> - There is no distinction between multinomial and binomial
>>> >> >> >> classification. For multinomial problems, it is necessary to
>>> >> >> >> handle
>>> >> >> >> multiple weight vectors and multiple confidences.
>>> >> >> >> - Models are not serialisable, which makes it hard to use them
>>> >> >> >> in
>>> >> >> >> practise.
>>> >> >> >>
>>> >> >> >> I started a pull request [1] some time ago. I would be happy to
>>> >> >> >> continue
>>> >> >> >> the discussion and clarify the interfaces, too!
>>> >> >> >>
>>> >> >> >> Cheers, Christoph
>>> >> >> >>
>>> >> >> >> [1] https://github.com/apache/spark/pull/2137/
>>> >> >> >>
>>> >> >> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov
>>> >> >> >> <[email protected]>:
>>> >> >> >>
>>> >> >> >>> Here in Yandex, during implementation of gradient boosting in
>>> >> >> >>> spark
>>> >> >> >>> and
>>> >> >> >>> creating our ML tool for internal use, we found next serious
>>> >> >> >>> problems
>>> >> >> in
>>> >> >> >>> MLLib:
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>>    - There is no Regression/Classification model abstraction.
>>> >> >> >>> We were
>>> >> >> >>>    building abstract data processing pipelines, which should
>>> >> >> >>> work just
>>> >> >> >>> with
>>> >> >> >>>    some regression - exact algorithm specified outside this
>>> >> >> >>> code.
>>> >> >> >>>    There
>>> >> >> >>> is no
>>> >> >> >>>    abstraction, which will allow me to do that. *(It's main
>>> >> >> >>> reason for
>>> >> >> >>> all
>>> >> >> >>>    further problems) *
>>> >> >> >>>    - There is no common practice among MLlib for testing
>>> >> >> >>> algorithms:
>>> >> >> >>> every
>>> >> >> >>>    model generates it's own random test data. There is no easy
>>> >> >> >>> extractable
>>> >> >> >>>    test cases applible to another algorithm. There is no
>>> >> >> >>> benchmarks
>>> >> >> >>>    for
>>> >> >> >>>    comparing algorithms. After implementing new algorithm it's
>>> >> >> >>> very
>>> >> >> hard
>>> >> >> >>> to
>>> >> >> >>>    understand how it should be tested.
>>> >> >> >>>    - Lack of serialization testing: MLlib algorithms don't
>>> >> >> >>> contain
>>> >> >> tests
>>> >> >> >>>    which test that model work after serialization.
>>> >> >> >>>    - During implementation of new algorithm it's hard to
>>> >> >> >>> understand
>>> >> >> what
>>> >> >> >>>    API you should create and which interface to implement.
>>> >> >> >>>
>>> >> >> >>> Start for solving all these problems must be done in creating
>>> >> >> >>> common
>>> >> >> >>> interface for typical algorithms/models - regression,
>>> >> >> >>> classification,
>>> >> >> >>> clustering, collaborative filtering.
>>> >> >> >>>
>>> >> >> >>> All main tests should be written against these interfaces, so
>>> >> >> >>> when new
>>> >> >> >>> algorithm implemented - all it should do is passed already
>>> >> >> >>> written
>>> >> >> tests.
>>> >> >> >>> It allow us to have managble quality among all lib.
>>> >> >> >>>
>>> >> >> >>> There should be couple benchmarks which allow new spark user to
>>> >> >> >>> get
>>> >> >> >>> feeling
>>> >> >> >>> about which algorithm to use.
>>> >> >> >>>
>>> >> >> >>> Test set against these abstractions should contain
>>> >> >> >>> serialization test.
>>> >> >> In
>>> >> >> >>> production most time there is no need in model, which can't be
>>> >> >> >>> stored.
>>> >> >> >>>
>>> >> >> >>> As the first step of this roadmap I'd like to create trait
>>> >> >> >>> RegressionModel,
>>> >> >> >>> *ADD* methods to current algorithms to implement this trait and
>>> >> >> >>> create
>>> >> >> >>> some
>>> >> >> >>> tests against it. Planning of doing it next week.
>>> >> >> >>>
>>> >> >> >>> Purpose of this letter is to collect any objections to this
>>> >> >> >>> approach
>>> >> >> >>> on
>>> >> >> >>> early stage: please give any feedback. Second reason is to set
>>> >> >> >>> lock on
>>> >> >> >>> this
>>> >> >> >>> activity so we wouldn't do the same thing twice: I'll create
>>> >> >> >>> pull
>>> >> >> request
>>> >> >> >>> by the end of the next week and any parallalizm in development
>>> >> >> >>> we can
>>> >> >> >>> start
>>> >> >> >>> from there.
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>> --
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>>> >> >> >>>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>>> >> >> >
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>>> >> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [email protected]
>>> >> For additional commands, e-mail: [email protected]
>>> >>
>>> >>
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: [email protected]
>>> > For additional commands, e-mail: [email protected]
>>> >
>>
>>
>>
>>
>> --
>> Sincerely yours
>> Egor Pakhomov
>> Scala Developer, Yandex
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Adding abstraction in MLlib

Reply via email to