Hi Egor, I posted the design doc for pipeline and parameters on the JIRA, now I'm trying to work out some details of ML datasets, which I will post it later this week. You feedback is welcome!
Best, Xiangrui On Mon, Sep 15, 2014 at 12:44 AM, Reynold Xin <r...@databricks.com> wrote: > Hi Egor, > > Thanks for the suggestion. It is definitely our intention and practice to > post design docs as soon as they are ready, and short iteration cycles. As a > matter of fact, we encourage design docs for major features posted before > implementation starts, and WIP pull requests before they are fully baked for > large features. > > That said, no, not 100% of a committer's time is on a specific ticket. There > are lots of tickets that are open for a long time before somebody starts > actively working on it. So no, it is not true that "all this time was active > development". Xiangrui should post the design doc as soon as it is ready for > feedback. > > > > On Sun, Sep 14, 2014 at 11:26 PM, Egor Pahomov <pahomov.e...@gmail.com> > wrote: >> >> It's good, that databricks working on this issue! However current process >> of working on that is not very clear for outsider. >> >> Last update on this ticket is August 5. If all this time was active >> development, I have concerns that without feedback from community for such >> long time development can fall in wrong way. >> Even if it would be great big patch as soon as you introduce new >> interfaces to community it would allow us to start working on our pipeline >> code. It would allow us write algorithm in new paradigm instead of in lack >> of any paradigms like it was before. It would allow us to help you transfer >> old code to new paradigm. >> >> My main point - shorter iterations with more transparency. >> >> I think it would be good idea to create some pull request with code, which >> you have so far, even if it doesn't pass tests, so just we can comment on it >> before formulating it in design doc. >> >> >> 2014-09-13 0:00 GMT+04:00 Patrick Wendell <pwend...@gmail.com>: >>> >>> We typically post design docs on JIRA's before major work starts. For >>> instance, pretty sure SPARk-1856 will have a design doc posted >>> shortly. >>> >>> On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson <e...@redhat.com> wrote: >>> > >>> > Are interface designs being captured anywhere as documents that the >>> > community can follow along with as the proposals evolve? >>> > >>> > I've worked on other open source projects where design docs were >>> > published as "living documents" (e.g. on google docs, or etherpad, but the >>> > particular mechanism isn't crucial). FWIW, I found that to be a good way >>> > to work in a community environment. >>> > >>> > >>> > ----- Original Message ----- >>> >> Hi Egor, >>> >> >>> >> Thanks for the feedback! We are aware of some of the issues you >>> >> mentioned and there are JIRAs created for them. Specifically, I'm >>> >> pushing out the design on pipeline features and algorithm/model >>> >> parameters this week. We can move our discussion to >>> >> https://issues.apache.org/jira/browse/SPARK-1856 . >>> >> >>> >> It would be nice to make tests against interfaces. But it definitely >>> >> needs more discussion before making PRs. For example, we discussed the >>> >> learning interfaces in Christoph's PR >>> >> (https://github.com/apache/spark/pull/2137/) but it takes time to >>> >> reach a consensus, especially on interfaces. Hopefully all of us could >>> >> benefit from the discussion. The best practice is to break down the >>> >> proposal into small independent piece and discuss them on the JIRA >>> >> before submitting PRs. >>> >> >>> >> For performance tests, there is a spark-perf package >>> >> (https://github.com/databricks/spark-perf) and we added performance >>> >> tests for MLlib in v1.1. But definitely more work needs to be done. >>> >> >>> >> The dev-list may not be a good place for discussion on the design, >>> >> could you create JIRAs for each of the issues you pointed out, and we >>> >> track the discussion on JIRA? Thanks! >>> >> >>> >> Best, >>> >> Xiangrui >>> >> >>> >> On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin <r...@databricks.com> >>> >> wrote: >>> >> > Xiangrui can comment more, but I believe Joseph and him are actually >>> >> > working on standardize interface and pipeline feature for 1.2 >>> >> > release. >>> >> > >>> >> > On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov >>> >> > <pahomov.e...@gmail.com> >>> >> > wrote: >>> >> > >>> >> >> Some architect suggestions on this matter - >>> >> >> https://github.com/apache/spark/pull/2371 >>> >> >> >>> >> >> 2014-09-12 16:38 GMT+04:00 Egor Pahomov <pahomov.e...@gmail.com>: >>> >> >> >>> >> >> > Sorry, I misswrote - I meant learners part of framework - models >>> >> >> > already >>> >> >> > exists. >>> >> >> > >>> >> >> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade < >>> >> >> > christoph.saw...@googlemail.com>: >>> >> >> > >>> >> >> >> I totally agree, and we discovered also some drawbacks with the >>> >> >> >> classification models implementation that are based on GLMs: >>> >> >> >> >>> >> >> >> - There is no distinction between predicting scores, classes, >>> >> >> >> and >>> >> >> >> calibrated scores (probabilities). For these models it is common >>> >> >> >> to >>> >> >> >> have >>> >> >> >> access to all of them and the prediction function >>> >> >> >> ``predict``should be >>> >> >> >> consistent and stateless. Currently, the score is only available >>> >> >> >> after >>> >> >> >> removing the threshold from the model. >>> >> >> >> - There is no distinction between multinomial and binomial >>> >> >> >> classification. For multinomial problems, it is necessary to >>> >> >> >> handle >>> >> >> >> multiple weight vectors and multiple confidences. >>> >> >> >> - Models are not serialisable, which makes it hard to use them >>> >> >> >> in >>> >> >> >> practise. >>> >> >> >> >>> >> >> >> I started a pull request [1] some time ago. I would be happy to >>> >> >> >> continue >>> >> >> >> the discussion and clarify the interfaces, too! >>> >> >> >> >>> >> >> >> Cheers, Christoph >>> >> >> >> >>> >> >> >> [1] https://github.com/apache/spark/pull/2137/ >>> >> >> >> >>> >> >> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov >>> >> >> >> <pahomov.e...@gmail.com>: >>> >> >> >> >>> >> >> >>> Here in Yandex, during implementation of gradient boosting in >>> >> >> >>> spark >>> >> >> >>> and >>> >> >> >>> creating our ML tool for internal use, we found next serious >>> >> >> >>> problems >>> >> >> in >>> >> >> >>> MLLib: >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> - There is no Regression/Classification model abstraction. >>> >> >> >>> We were >>> >> >> >>> building abstract data processing pipelines, which should >>> >> >> >>> work just >>> >> >> >>> with >>> >> >> >>> some regression - exact algorithm specified outside this >>> >> >> >>> code. >>> >> >> >>> There >>> >> >> >>> is no >>> >> >> >>> abstraction, which will allow me to do that. *(It's main >>> >> >> >>> reason for >>> >> >> >>> all >>> >> >> >>> further problems) * >>> >> >> >>> - There is no common practice among MLlib for testing >>> >> >> >>> algorithms: >>> >> >> >>> every >>> >> >> >>> model generates it's own random test data. There is no easy >>> >> >> >>> extractable >>> >> >> >>> test cases applible to another algorithm. There is no >>> >> >> >>> benchmarks >>> >> >> >>> for >>> >> >> >>> comparing algorithms. After implementing new algorithm it's >>> >> >> >>> very >>> >> >> hard >>> >> >> >>> to >>> >> >> >>> understand how it should be tested. >>> >> >> >>> - Lack of serialization testing: MLlib algorithms don't >>> >> >> >>> contain >>> >> >> tests >>> >> >> >>> which test that model work after serialization. >>> >> >> >>> - During implementation of new algorithm it's hard to >>> >> >> >>> understand >>> >> >> what >>> >> >> >>> API you should create and which interface to implement. >>> >> >> >>> >>> >> >> >>> Start for solving all these problems must be done in creating >>> >> >> >>> common >>> >> >> >>> interface for typical algorithms/models - regression, >>> >> >> >>> classification, >>> >> >> >>> clustering, collaborative filtering. >>> >> >> >>> >>> >> >> >>> All main tests should be written against these interfaces, so >>> >> >> >>> when new >>> >> >> >>> algorithm implemented - all it should do is passed already >>> >> >> >>> written >>> >> >> tests. >>> >> >> >>> It allow us to have managble quality among all lib. >>> >> >> >>> >>> >> >> >>> There should be couple benchmarks which allow new spark user to >>> >> >> >>> get >>> >> >> >>> feeling >>> >> >> >>> about which algorithm to use. >>> >> >> >>> >>> >> >> >>> Test set against these abstractions should contain >>> >> >> >>> serialization test. >>> >> >> In >>> >> >> >>> production most time there is no need in model, which can't be >>> >> >> >>> stored. >>> >> >> >>> >>> >> >> >>> As the first step of this roadmap I'd like to create trait >>> >> >> >>> RegressionModel, >>> >> >> >>> *ADD* methods to current algorithms to implement this trait and >>> >> >> >>> create >>> >> >> >>> some >>> >> >> >>> tests against it. Planning of doing it next week. >>> >> >> >>> >>> >> >> >>> Purpose of this letter is to collect any objections to this >>> >> >> >>> approach >>> >> >> >>> on >>> >> >> >>> early stage: please give any feedback. Second reason is to set >>> >> >> >>> lock on >>> >> >> >>> this >>> >> >> >>> activity so we wouldn't do the same thing twice: I'll create >>> >> >> >>> pull >>> >> >> request >>> >> >> >>> by the end of the next week and any parallalizm in development >>> >> >> >>> we can >>> >> >> >>> start >>> >> >> >>> from there. >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> -- >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> >> >> >>> *Sincerely yoursEgor PakhomovScala Developer, Yandex* >>> >> >> >>> >>> >> >> >> >>> >> >> >> >>> >> >> > >>> >> >> > >>> >> >> > -- >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > *Sincerely yoursEgor PakhomovScala Developer, Yandex* >>> >> >> > >>> >> >> >>> >> >> >>> >> >> >>> >> >> -- >>> >> >> >>> >> >> >>> >> >> >>> >> >> *Sincerely yoursEgor PakhomovScala Developer, Yandex* >>> >> >> >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> >> For additional commands, e-mail: dev-h...@spark.apache.org >>> >> >>> >> >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> > For additional commands, e-mail: dev-h...@spark.apache.org >>> > >> >> >> >> >> -- >> Sincerely yours >> Egor Pakhomov >> Scala Developer, Yandex > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org