Hi Egor,

Thanks for the feedback! We are aware of some of the issues you
mentioned and there are JIRAs created for them. Specifically, I'm
pushing out the design on pipeline features and algorithm/model
parameters this week. We can move our discussion to
https://issues.apache.org/jira/browse/SPARK-1856 .

It would be nice to make tests against interfaces. But it definitely
needs more discussion before making PRs. For example, we discussed the
learning interfaces in Christoph's PR
(https://github.com/apache/spark/pull/2137/) but it takes time to
reach a consensus, especially on interfaces. Hopefully all of us could
benefit from the discussion. The best practice is to break down the
proposal into small independent piece and discuss them on the JIRA
before submitting PRs.

For performance tests, there is a spark-perf package
(https://github.com/databricks/spark-perf) and we added performance
tests for MLlib in v1.1. But definitely more work needs to be done.

The dev-list may not be a good place for discussion on the design,
could you create JIRAs for each of the issues you pointed out, and we
track the discussion on JIRA? Thanks!

Best,
Xiangrui

On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin <r...@databricks.com> wrote:
> Xiangrui can comment more, but I believe Joseph and him are actually
> working on standardize interface and pipeline feature for 1.2 release.
>
> On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov <pahomov.e...@gmail.com>
> wrote:
>
>> Some architect suggestions on this matter -
>> https://github.com/apache/spark/pull/2371
>>
>> 2014-09-12 16:38 GMT+04:00 Egor Pahomov <pahomov.e...@gmail.com>:
>>
>> > Sorry, I misswrote  - I meant learners part of framework - models already
>> > exists.
>> >
>> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
>> > christoph.saw...@googlemail.com>:
>> >
>> >> I totally agree, and we discovered also some drawbacks with the
>> >> classification models implementation that are based on GLMs:
>> >>
>> >> - There is no distinction between predicting scores, classes, and
>> >> calibrated scores (probabilities). For these models it is common to have
>> >> access to all of them and the prediction function ``predict``should be
>> >> consistent and stateless. Currently, the score is only available after
>> >> removing the threshold from the model.
>> >> - There is no distinction between multinomial and binomial
>> >> classification. For multinomial problems, it is necessary to handle
>> >> multiple weight vectors and multiple confidences.
>> >> - Models are not serialisable, which makes it hard to use them in
>> >> practise.
>> >>
>> >> I started a pull request [1] some time ago. I would be happy to continue
>> >> the discussion and clarify the interfaces, too!
>> >>
>> >> Cheers, Christoph
>> >>
>> >> [1] https://github.com/apache/spark/pull/2137/
>> >>
>> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>:
>> >>
>> >>> Here in Yandex, during implementation of gradient boosting in spark and
>> >>> creating our ML tool for internal use, we found next serious problems
>> in
>> >>> MLLib:
>> >>>
>> >>>
>> >>>    - There is no Regression/Classification model abstraction. We were
>> >>>    building abstract data processing pipelines, which should work just
>> >>> with
>> >>>    some regression - exact algorithm specified outside this code. There
>> >>> is no
>> >>>    abstraction, which will allow me to do that. *(It's main reason for
>> >>> all
>> >>>    further problems) *
>> >>>    - There is no common practice among MLlib for testing algorithms:
>> >>> every
>> >>>    model generates it's own random test data. There is no easy
>> >>> extractable
>> >>>    test cases applible to another algorithm. There is no benchmarks for
>> >>>    comparing algorithms. After implementing new algorithm it's very
>> hard
>> >>> to
>> >>>    understand how it should be tested.
>> >>>    - Lack of serialization testing: MLlib algorithms don't contain
>> tests
>> >>>    which test that model work after serialization.
>> >>>    - During implementation of new algorithm it's hard to understand
>> what
>> >>>    API you should create and which interface to implement.
>> >>>
>> >>> Start for solving all these problems must be done in creating common
>> >>> interface for typical algorithms/models - regression, classification,
>> >>> clustering, collaborative filtering.
>> >>>
>> >>> All main tests should be written against these interfaces, so when new
>> >>> algorithm implemented - all it should do is passed already written
>> tests.
>> >>> It allow us to have managble quality among all lib.
>> >>>
>> >>> There should be couple benchmarks which allow new spark user to get
>> >>> feeling
>> >>> about which algorithm to use.
>> >>>
>> >>> Test set against these abstractions should contain serialization test.
>> In
>> >>> production most time there is no need in model, which can't be stored.
>> >>>
>> >>> As the first step of this roadmap I'd like to create trait
>> >>> RegressionModel,
>> >>> *ADD* methods to current algorithms to implement this trait and create
>> >>> some
>> >>> tests against it. Planning of doing it next week.
>> >>>
>> >>> Purpose of this letter is to collect any objections to this approach on
>> >>> early stage: please give any feedback. Second reason is to set lock on
>> >>> this
>> >>> activity so we wouldn't do the same thing twice: I'll create pull
>> request
>> >>> by the end of the next week and any parallalizm in development we can
>> >>> start
>> >>> from there.
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>>
>> >>>
>> >>>
>> >>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>> >>>
>> >>
>> >>
>> >
>> >
>> > --
>> >
>> >
>> >
>> > *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>> >
>>
>>
>>
>> --
>>
>>
>>
>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to