Re: Adding abstraction in MLlib

2014-09-17 Thread Xiangrui Meng
Hi Egor,

I posted the design doc for pipeline and parameters on the JIRA, now
I'm trying to work out some details of ML datasets, which I will post
it later this week. You feedback is welcome!

Best,
Xiangrui

On Mon, Sep 15, 2014 at 12:44 AM, Reynold Xin r...@databricks.com wrote:
 Hi Egor,

 Thanks for the suggestion. It is definitely our intention and practice to
 post design docs as soon as they are ready, and short iteration cycles. As a
 matter of fact, we encourage design docs for major features posted before
 implementation starts, and WIP pull requests before they are fully baked for
 large features.

 That said, no, not 100% of a committer's time is on a specific ticket. There
 are lots of tickets that are open for a long time before somebody starts
 actively working on it. So no, it is not true that all this time was active
 development. Xiangrui should post the design doc as soon as it is ready for
 feedback.



 On Sun, Sep 14, 2014 at 11:26 PM, Egor Pahomov pahomov.e...@gmail.com
 wrote:

 It's good, that databricks working on this issue! However current process
 of working on that is not very clear for outsider.

 Last update on this ticket is August 5. If all this time was active
 development, I have concerns that without feedback from community for such
 long time development can fall in wrong way.
 Even if it would be great big patch as soon as you introduce new
 interfaces to community it would allow us to start working on our pipeline
 code. It would allow us write algorithm in new paradigm instead of in lack
 of any paradigms like it was before. It would allow us to help you transfer
 old code to new paradigm.

 My main point - shorter iterations with more transparency.

 I think it would be good idea to create some pull request with code, which
 you have so far, even if it doesn't pass tests, so just we can comment on it
 before formulating it in design doc.


 2014-09-13 0:00 GMT+04:00 Patrick Wendell pwend...@gmail.com:

 We typically post design docs on JIRA's before major work starts. For
 instance, pretty sure SPARk-1856 will have a design doc posted
 shortly.

 On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson e...@redhat.com wrote:
 
  Are interface designs being captured anywhere as documents that the
  community can follow along with as the proposals evolve?
 
  I've worked on other open source projects where design docs were
  published as living documents (e.g. on google docs, or etherpad, but the
  particular mechanism isn't crucial).   FWIW, I found that to be a good way
  to work in a community environment.
 
 
  - Original Message -
  Hi Egor,
 
  Thanks for the feedback! We are aware of some of the issues you
  mentioned and there are JIRAs created for them. Specifically, I'm
  pushing out the design on pipeline features and algorithm/model
  parameters this week. We can move our discussion to
  https://issues.apache.org/jira/browse/SPARK-1856 .
 
  It would be nice to make tests against interfaces. But it definitely
  needs more discussion before making PRs. For example, we discussed the
  learning interfaces in Christoph's PR
  (https://github.com/apache/spark/pull/2137/) but it takes time to
  reach a consensus, especially on interfaces. Hopefully all of us could
  benefit from the discussion. The best practice is to break down the
  proposal into small independent piece and discuss them on the JIRA
  before submitting PRs.
 
  For performance tests, there is a spark-perf package
  (https://github.com/databricks/spark-perf) and we added performance
  tests for MLlib in v1.1. But definitely more work needs to be done.
 
  The dev-list may not be a good place for discussion on the design,
  could you create JIRAs for each of the issues you pointed out, and we
  track the discussion on JIRA? Thanks!
 
  Best,
  Xiangrui
 
  On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin r...@databricks.com
  wrote:
   Xiangrui can comment more, but I believe Joseph and him are actually
   working on standardize interface and pipeline feature for 1.2
   release.
  
   On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov
   pahomov.e...@gmail.com
   wrote:
  
   Some architect suggestions on this matter -
   https://github.com/apache/spark/pull/2371
  
   2014-09-12 16:38 GMT+04:00 Egor Pahomov pahomov.e...@gmail.com:
  
Sorry, I misswrote  - I meant learners part of framework - models
already
exists.
   
2014-09-12 15:53 GMT+04:00 Christoph Sawade 
christoph.saw...@googlemail.com:
   
I totally agree, and we discovered also some drawbacks with the
classification models implementation that are based on GLMs:
   
- There is no distinction between predicting scores, classes,
and
calibrated scores (probabilities). For these models it is common
to
have
access to all of them and the prediction function
``predict``should be
consistent and stateless. Currently, the score is only available
after
removing the threshold from the 

Re: Adding abstraction in MLlib

2014-09-15 Thread Egor Pahomov
It's good, that databricks working on this issue! However current process
of working on that is not very clear for outsider.

   - Last update on this ticket is August 5. If all this time was active
   development, I have concerns that without feedback from community for such
   long time development can fall in wrong way.
   - Even if it would be great big patch as soon as you introduce new
   interfaces to community it would allow us to start working on our pipeline
   code. It would allow us write algorithm in new paradigm instead of in lack
   of any paradigms like it was before. It would allow us to help you transfer
   old code to new paradigm.

My main point - shorter iterations with more transparency.

I think it would be good idea to create some pull request with code, which
you have so far, even if it doesn't pass tests, so just we can comment on
it before formulating it in design doc.


2014-09-13 0:00 GMT+04:00 Patrick Wendell pwend...@gmail.com:

 We typically post design docs on JIRA's before major work starts. For
 instance, pretty sure SPARk-1856 will have a design doc posted
 shortly.

 On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson e...@redhat.com wrote:
 
  Are interface designs being captured anywhere as documents that the
 community can follow along with as the proposals evolve?
 
  I've worked on other open source projects where design docs were
 published as living documents (e.g. on google docs, or etherpad, but the
 particular mechanism isn't crucial).   FWIW, I found that to be a good way
 to work in a community environment.
 
 
  - Original Message -
  Hi Egor,
 
  Thanks for the feedback! We are aware of some of the issues you
  mentioned and there are JIRAs created for them. Specifically, I'm
  pushing out the design on pipeline features and algorithm/model
  parameters this week. We can move our discussion to
  https://issues.apache.org/jira/browse/SPARK-1856 .
 
  It would be nice to make tests against interfaces. But it definitely
  needs more discussion before making PRs. For example, we discussed the
  learning interfaces in Christoph's PR
  (https://github.com/apache/spark/pull/2137/) but it takes time to
  reach a consensus, especially on interfaces. Hopefully all of us could
  benefit from the discussion. The best practice is to break down the
  proposal into small independent piece and discuss them on the JIRA
  before submitting PRs.
 
  For performance tests, there is a spark-perf package
  (https://github.com/databricks/spark-perf) and we added performance
  tests for MLlib in v1.1. But definitely more work needs to be done.
 
  The dev-list may not be a good place for discussion on the design,
  could you create JIRAs for each of the issues you pointed out, and we
  track the discussion on JIRA? Thanks!
 
  Best,
  Xiangrui
 
  On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin r...@databricks.com
 wrote:
   Xiangrui can comment more, but I believe Joseph and him are actually
   working on standardize interface and pipeline feature for 1.2 release.
  
   On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov pahomov.e...@gmail.com
 
   wrote:
  
   Some architect suggestions on this matter -
   https://github.com/apache/spark/pull/2371
  
   2014-09-12 16:38 GMT+04:00 Egor Pahomov pahomov.e...@gmail.com:
  
Sorry, I misswrote  - I meant learners part of framework - models
already
exists.
   
2014-09-12 15:53 GMT+04:00 Christoph Sawade 
christoph.saw...@googlemail.com:
   
I totally agree, and we discovered also some drawbacks with the
classification models implementation that are based on GLMs:
   
- There is no distinction between predicting scores, classes, and
calibrated scores (probabilities). For these models it is common
 to
have
access to all of them and the prediction function
 ``predict``should be
consistent and stateless. Currently, the score is only available
 after
removing the threshold from the model.
- There is no distinction between multinomial and binomial
classification. For multinomial problems, it is necessary to
 handle
multiple weight vectors and multiple confidences.
- Models are not serialisable, which makes it hard to use them in
practise.
   
I started a pull request [1] some time ago. I would be happy to
continue
the discussion and clarify the interfaces, too!
   
Cheers, Christoph
   
[1] https://github.com/apache/spark/pull/2137/
   
2014-09-12 11:11 GMT+02:00 Egor Pahomov pahomov.e...@gmail.com:
   
Here in Yandex, during implementation of gradient boosting in
 spark
and
creating our ML tool for internal use, we found next serious
 problems
   in
MLLib:
   
   
   - There is no Regression/Classification model abstraction. We
 were
   building abstract data processing pipelines, which should
 work just
with
   some regression - exact algorithm specified outside this code.
   There
is no
   abstraction, 

Re: Adding abstraction in MLlib

2014-09-12 Thread Egor Pahomov
Some architect suggestions on this matter -
https://github.com/apache/spark/pull/2371

2014-09-12 16:38 GMT+04:00 Egor Pahomov pahomov.e...@gmail.com:

 Sorry, I misswrote  - I meant learners part of framework - models already
 exists.

 2014-09-12 15:53 GMT+04:00 Christoph Sawade 
 christoph.saw...@googlemail.com:

 I totally agree, and we discovered also some drawbacks with the
 classification models implementation that are based on GLMs:

 - There is no distinction between predicting scores, classes, and
 calibrated scores (probabilities). For these models it is common to have
 access to all of them and the prediction function ``predict``should be
 consistent and stateless. Currently, the score is only available after
 removing the threshold from the model.
 - There is no distinction between multinomial and binomial
 classification. For multinomial problems, it is necessary to handle
 multiple weight vectors and multiple confidences.
 - Models are not serialisable, which makes it hard to use them in
 practise.

 I started a pull request [1] some time ago. I would be happy to continue
 the discussion and clarify the interfaces, too!

 Cheers, Christoph

 [1] https://github.com/apache/spark/pull/2137/

 2014-09-12 11:11 GMT+02:00 Egor Pahomov pahomov.e...@gmail.com:

 Here in Yandex, during implementation of gradient boosting in spark and
 creating our ML tool for internal use, we found next serious problems in
 MLLib:


- There is no Regression/Classification model abstraction. We were
building abstract data processing pipelines, which should work just
 with
some regression - exact algorithm specified outside this code. There
 is no
abstraction, which will allow me to do that. *(It's main reason for
 all
further problems) *
- There is no common practice among MLlib for testing algorithms:
 every
model generates it's own random test data. There is no easy
 extractable
test cases applible to another algorithm. There is no benchmarks for
comparing algorithms. After implementing new algorithm it's very hard
 to
understand how it should be tested.
- Lack of serialization testing: MLlib algorithms don't contain tests
which test that model work after serialization.
- During implementation of new algorithm it's hard to understand what
API you should create and which interface to implement.

 Start for solving all these problems must be done in creating common
 interface for typical algorithms/models - regression, classification,
 clustering, collaborative filtering.

 All main tests should be written against these interfaces, so when new
 algorithm implemented - all it should do is passed already written tests.
 It allow us to have managble quality among all lib.

 There should be couple benchmarks which allow new spark user to get
 feeling
 about which algorithm to use.

 Test set against these abstractions should contain serialization test. In
 production most time there is no need in model, which can't be stored.

 As the first step of this roadmap I'd like to create trait
 RegressionModel,
 *ADD* methods to current algorithms to implement this trait and create
 some
 tests against it. Planning of doing it next week.

 Purpose of this letter is to collect any objections to this approach on
 early stage: please give any feedback. Second reason is to set lock on
 this
 activity so we wouldn't do the same thing twice: I'll create pull request
 by the end of the next week and any parallalizm in development we can
 start
 from there.



 --



 *Sincerely yoursEgor PakhomovScala Developer, Yandex*





 --



 *Sincerely yoursEgor PakhomovScala Developer, Yandex*




-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: Adding abstraction in MLlib

2014-09-12 Thread Reynold Xin
Xiangrui can comment more, but I believe Joseph and him are actually
working on standardize interface and pipeline feature for 1.2 release.

On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov pahomov.e...@gmail.com
wrote:

 Some architect suggestions on this matter -
 https://github.com/apache/spark/pull/2371

 2014-09-12 16:38 GMT+04:00 Egor Pahomov pahomov.e...@gmail.com:

  Sorry, I misswrote  - I meant learners part of framework - models already
  exists.
 
  2014-09-12 15:53 GMT+04:00 Christoph Sawade 
  christoph.saw...@googlemail.com:
 
  I totally agree, and we discovered also some drawbacks with the
  classification models implementation that are based on GLMs:
 
  - There is no distinction between predicting scores, classes, and
  calibrated scores (probabilities). For these models it is common to have
  access to all of them and the prediction function ``predict``should be
  consistent and stateless. Currently, the score is only available after
  removing the threshold from the model.
  - There is no distinction between multinomial and binomial
  classification. For multinomial problems, it is necessary to handle
  multiple weight vectors and multiple confidences.
  - Models are not serialisable, which makes it hard to use them in
  practise.
 
  I started a pull request [1] some time ago. I would be happy to continue
  the discussion and clarify the interfaces, too!
 
  Cheers, Christoph
 
  [1] https://github.com/apache/spark/pull/2137/
 
  2014-09-12 11:11 GMT+02:00 Egor Pahomov pahomov.e...@gmail.com:
 
  Here in Yandex, during implementation of gradient boosting in spark and
  creating our ML tool for internal use, we found next serious problems
 in
  MLLib:
 
 
 - There is no Regression/Classification model abstraction. We were
 building abstract data processing pipelines, which should work just
  with
 some regression - exact algorithm specified outside this code. There
  is no
 abstraction, which will allow me to do that. *(It's main reason for
  all
 further problems) *
 - There is no common practice among MLlib for testing algorithms:
  every
 model generates it's own random test data. There is no easy
  extractable
 test cases applible to another algorithm. There is no benchmarks for
 comparing algorithms. After implementing new algorithm it's very
 hard
  to
 understand how it should be tested.
 - Lack of serialization testing: MLlib algorithms don't contain
 tests
 which test that model work after serialization.
 - During implementation of new algorithm it's hard to understand
 what
 API you should create and which interface to implement.
 
  Start for solving all these problems must be done in creating common
  interface for typical algorithms/models - regression, classification,
  clustering, collaborative filtering.
 
  All main tests should be written against these interfaces, so when new
  algorithm implemented - all it should do is passed already written
 tests.
  It allow us to have managble quality among all lib.
 
  There should be couple benchmarks which allow new spark user to get
  feeling
  about which algorithm to use.
 
  Test set against these abstractions should contain serialization test.
 In
  production most time there is no need in model, which can't be stored.
 
  As the first step of this roadmap I'd like to create trait
  RegressionModel,
  *ADD* methods to current algorithms to implement this trait and create
  some
  tests against it. Planning of doing it next week.
 
  Purpose of this letter is to collect any objections to this approach on
  early stage: please give any feedback. Second reason is to set lock on
  this
  activity so we wouldn't do the same thing twice: I'll create pull
 request
  by the end of the next week and any parallalizm in development we can
  start
  from there.
 
 
 
  --
 
 
 
  *Sincerely yoursEgor PakhomovScala Developer, Yandex*
 
 
 
 
 
  --
 
 
 
  *Sincerely yoursEgor PakhomovScala Developer, Yandex*
 



 --



 *Sincerely yoursEgor PakhomovScala Developer, Yandex*



Re: Adding abstraction in MLlib

2014-09-12 Thread Xiangrui Meng
Hi Egor,

Thanks for the feedback! We are aware of some of the issues you
mentioned and there are JIRAs created for them. Specifically, I'm
pushing out the design on pipeline features and algorithm/model
parameters this week. We can move our discussion to
https://issues.apache.org/jira/browse/SPARK-1856 .

It would be nice to make tests against interfaces. But it definitely
needs more discussion before making PRs. For example, we discussed the
learning interfaces in Christoph's PR
(https://github.com/apache/spark/pull/2137/) but it takes time to
reach a consensus, especially on interfaces. Hopefully all of us could
benefit from the discussion. The best practice is to break down the
proposal into small independent piece and discuss them on the JIRA
before submitting PRs.

For performance tests, there is a spark-perf package
(https://github.com/databricks/spark-perf) and we added performance
tests for MLlib in v1.1. But definitely more work needs to be done.

The dev-list may not be a good place for discussion on the design,
could you create JIRAs for each of the issues you pointed out, and we
track the discussion on JIRA? Thanks!

Best,
Xiangrui

On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin r...@databricks.com wrote:
 Xiangrui can comment more, but I believe Joseph and him are actually
 working on standardize interface and pipeline feature for 1.2 release.

 On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov pahomov.e...@gmail.com
 wrote:

 Some architect suggestions on this matter -
 https://github.com/apache/spark/pull/2371

 2014-09-12 16:38 GMT+04:00 Egor Pahomov pahomov.e...@gmail.com:

  Sorry, I misswrote  - I meant learners part of framework - models already
  exists.
 
  2014-09-12 15:53 GMT+04:00 Christoph Sawade 
  christoph.saw...@googlemail.com:
 
  I totally agree, and we discovered also some drawbacks with the
  classification models implementation that are based on GLMs:
 
  - There is no distinction between predicting scores, classes, and
  calibrated scores (probabilities). For these models it is common to have
  access to all of them and the prediction function ``predict``should be
  consistent and stateless. Currently, the score is only available after
  removing the threshold from the model.
  - There is no distinction between multinomial and binomial
  classification. For multinomial problems, it is necessary to handle
  multiple weight vectors and multiple confidences.
  - Models are not serialisable, which makes it hard to use them in
  practise.
 
  I started a pull request [1] some time ago. I would be happy to continue
  the discussion and clarify the interfaces, too!
 
  Cheers, Christoph
 
  [1] https://github.com/apache/spark/pull/2137/
 
  2014-09-12 11:11 GMT+02:00 Egor Pahomov pahomov.e...@gmail.com:
 
  Here in Yandex, during implementation of gradient boosting in spark and
  creating our ML tool for internal use, we found next serious problems
 in
  MLLib:
 
 
 - There is no Regression/Classification model abstraction. We were
 building abstract data processing pipelines, which should work just
  with
 some regression - exact algorithm specified outside this code. There
  is no
 abstraction, which will allow me to do that. *(It's main reason for
  all
 further problems) *
 - There is no common practice among MLlib for testing algorithms:
  every
 model generates it's own random test data. There is no easy
  extractable
 test cases applible to another algorithm. There is no benchmarks for
 comparing algorithms. After implementing new algorithm it's very
 hard
  to
 understand how it should be tested.
 - Lack of serialization testing: MLlib algorithms don't contain
 tests
 which test that model work after serialization.
 - During implementation of new algorithm it's hard to understand
 what
 API you should create and which interface to implement.
 
  Start for solving all these problems must be done in creating common
  interface for typical algorithms/models - regression, classification,
  clustering, collaborative filtering.
 
  All main tests should be written against these interfaces, so when new
  algorithm implemented - all it should do is passed already written
 tests.
  It allow us to have managble quality among all lib.
 
  There should be couple benchmarks which allow new spark user to get
  feeling
  about which algorithm to use.
 
  Test set against these abstractions should contain serialization test.
 In
  production most time there is no need in model, which can't be stored.
 
  As the first step of this roadmap I'd like to create trait
  RegressionModel,
  *ADD* methods to current algorithms to implement this trait and create
  some
  tests against it. Planning of doing it next week.
 
  Purpose of this letter is to collect any objections to this approach on
  early stage: please give any feedback. Second reason is to set lock on
  this
  activity so we wouldn't do the same thing twice: I'll create pull
 

Re: Adding abstraction in MLlib

2014-09-12 Thread Erik Erlandson

Are interface designs being captured anywhere as documents that the community 
can follow along with as the proposals evolve?

I've worked on other open source projects where design docs were published as 
living documents (e.g. on google docs, or etherpad, but the particular 
mechanism isn't crucial).   FWIW, I found that to be a good way to work in a 
community environment.


- Original Message -
 Hi Egor,
 
 Thanks for the feedback! We are aware of some of the issues you
 mentioned and there are JIRAs created for them. Specifically, I'm
 pushing out the design on pipeline features and algorithm/model
 parameters this week. We can move our discussion to
 https://issues.apache.org/jira/browse/SPARK-1856 .
 
 It would be nice to make tests against interfaces. But it definitely
 needs more discussion before making PRs. For example, we discussed the
 learning interfaces in Christoph's PR
 (https://github.com/apache/spark/pull/2137/) but it takes time to
 reach a consensus, especially on interfaces. Hopefully all of us could
 benefit from the discussion. The best practice is to break down the
 proposal into small independent piece and discuss them on the JIRA
 before submitting PRs.
 
 For performance tests, there is a spark-perf package
 (https://github.com/databricks/spark-perf) and we added performance
 tests for MLlib in v1.1. But definitely more work needs to be done.
 
 The dev-list may not be a good place for discussion on the design,
 could you create JIRAs for each of the issues you pointed out, and we
 track the discussion on JIRA? Thanks!
 
 Best,
 Xiangrui
 
 On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin r...@databricks.com wrote:
  Xiangrui can comment more, but I believe Joseph and him are actually
  working on standardize interface and pipeline feature for 1.2 release.
 
  On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov pahomov.e...@gmail.com
  wrote:
 
  Some architect suggestions on this matter -
  https://github.com/apache/spark/pull/2371
 
  2014-09-12 16:38 GMT+04:00 Egor Pahomov pahomov.e...@gmail.com:
 
   Sorry, I misswrote  - I meant learners part of framework - models
   already
   exists.
  
   2014-09-12 15:53 GMT+04:00 Christoph Sawade 
   christoph.saw...@googlemail.com:
  
   I totally agree, and we discovered also some drawbacks with the
   classification models implementation that are based on GLMs:
  
   - There is no distinction between predicting scores, classes, and
   calibrated scores (probabilities). For these models it is common to
   have
   access to all of them and the prediction function ``predict``should be
   consistent and stateless. Currently, the score is only available after
   removing the threshold from the model.
   - There is no distinction between multinomial and binomial
   classification. For multinomial problems, it is necessary to handle
   multiple weight vectors and multiple confidences.
   - Models are not serialisable, which makes it hard to use them in
   practise.
  
   I started a pull request [1] some time ago. I would be happy to
   continue
   the discussion and clarify the interfaces, too!
  
   Cheers, Christoph
  
   [1] https://github.com/apache/spark/pull/2137/
  
   2014-09-12 11:11 GMT+02:00 Egor Pahomov pahomov.e...@gmail.com:
  
   Here in Yandex, during implementation of gradient boosting in spark
   and
   creating our ML tool for internal use, we found next serious problems
  in
   MLLib:
  
  
  - There is no Regression/Classification model abstraction. We were
  building abstract data processing pipelines, which should work just
   with
  some regression - exact algorithm specified outside this code.
  There
   is no
  abstraction, which will allow me to do that. *(It's main reason for
   all
  further problems) *
  - There is no common practice among MLlib for testing algorithms:
   every
  model generates it's own random test data. There is no easy
   extractable
  test cases applible to another algorithm. There is no benchmarks
  for
  comparing algorithms. After implementing new algorithm it's very
  hard
   to
  understand how it should be tested.
  - Lack of serialization testing: MLlib algorithms don't contain
  tests
  which test that model work after serialization.
  - During implementation of new algorithm it's hard to understand
  what
  API you should create and which interface to implement.
  
   Start for solving all these problems must be done in creating common
   interface for typical algorithms/models - regression, classification,
   clustering, collaborative filtering.
  
   All main tests should be written against these interfaces, so when new
   algorithm implemented - all it should do is passed already written
  tests.
   It allow us to have managble quality among all lib.
  
   There should be couple benchmarks which allow new spark user to get
   feeling
   about which algorithm to use.
  
   Test set against these abstractions 

Re: Adding abstraction in MLlib

2014-09-12 Thread Patrick Wendell
We typically post design docs on JIRA's before major work starts. For
instance, pretty sure SPARk-1856 will have a design doc posted
shortly.

On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson e...@redhat.com wrote:

 Are interface designs being captured anywhere as documents that the community 
 can follow along with as the proposals evolve?

 I've worked on other open source projects where design docs were published as 
 living documents (e.g. on google docs, or etherpad, but the particular 
 mechanism isn't crucial).   FWIW, I found that to be a good way to work in a 
 community environment.


 - Original Message -
 Hi Egor,

 Thanks for the feedback! We are aware of some of the issues you
 mentioned and there are JIRAs created for them. Specifically, I'm
 pushing out the design on pipeline features and algorithm/model
 parameters this week. We can move our discussion to
 https://issues.apache.org/jira/browse/SPARK-1856 .

 It would be nice to make tests against interfaces. But it definitely
 needs more discussion before making PRs. For example, we discussed the
 learning interfaces in Christoph's PR
 (https://github.com/apache/spark/pull/2137/) but it takes time to
 reach a consensus, especially on interfaces. Hopefully all of us could
 benefit from the discussion. The best practice is to break down the
 proposal into small independent piece and discuss them on the JIRA
 before submitting PRs.

 For performance tests, there is a spark-perf package
 (https://github.com/databricks/spark-perf) and we added performance
 tests for MLlib in v1.1. But definitely more work needs to be done.

 The dev-list may not be a good place for discussion on the design,
 could you create JIRAs for each of the issues you pointed out, and we
 track the discussion on JIRA? Thanks!

 Best,
 Xiangrui

 On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin r...@databricks.com wrote:
  Xiangrui can comment more, but I believe Joseph and him are actually
  working on standardize interface and pipeline feature for 1.2 release.
 
  On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov pahomov.e...@gmail.com
  wrote:
 
  Some architect suggestions on this matter -
  https://github.com/apache/spark/pull/2371
 
  2014-09-12 16:38 GMT+04:00 Egor Pahomov pahomov.e...@gmail.com:
 
   Sorry, I misswrote  - I meant learners part of framework - models
   already
   exists.
  
   2014-09-12 15:53 GMT+04:00 Christoph Sawade 
   christoph.saw...@googlemail.com:
  
   I totally agree, and we discovered also some drawbacks with the
   classification models implementation that are based on GLMs:
  
   - There is no distinction between predicting scores, classes, and
   calibrated scores (probabilities). For these models it is common to
   have
   access to all of them and the prediction function ``predict``should be
   consistent and stateless. Currently, the score is only available after
   removing the threshold from the model.
   - There is no distinction between multinomial and binomial
   classification. For multinomial problems, it is necessary to handle
   multiple weight vectors and multiple confidences.
   - Models are not serialisable, which makes it hard to use them in
   practise.
  
   I started a pull request [1] some time ago. I would be happy to
   continue
   the discussion and clarify the interfaces, too!
  
   Cheers, Christoph
  
   [1] https://github.com/apache/spark/pull/2137/
  
   2014-09-12 11:11 GMT+02:00 Egor Pahomov pahomov.e...@gmail.com:
  
   Here in Yandex, during implementation of gradient boosting in spark
   and
   creating our ML tool for internal use, we found next serious problems
  in
   MLLib:
  
  
  - There is no Regression/Classification model abstraction. We were
  building abstract data processing pipelines, which should work just
   with
  some regression - exact algorithm specified outside this code.
  There
   is no
  abstraction, which will allow me to do that. *(It's main reason for
   all
  further problems) *
  - There is no common practice among MLlib for testing algorithms:
   every
  model generates it's own random test data. There is no easy
   extractable
  test cases applible to another algorithm. There is no benchmarks
  for
  comparing algorithms. After implementing new algorithm it's very
  hard
   to
  understand how it should be tested.
  - Lack of serialization testing: MLlib algorithms don't contain
  tests
  which test that model work after serialization.
  - During implementation of new algorithm it's hard to understand
  what
  API you should create and which interface to implement.
  
   Start for solving all these problems must be done in creating common
   interface for typical algorithms/models - regression, classification,
   clustering, collaborative filtering.
  
   All main tests should be written against these interfaces, so when new
   algorithm implemented - all it should do is passed already written
  tests.