Sorry, I misswrote - I meant learners part of framework - models already exists.
2014-09-12 15:53 GMT+04:00 Christoph Sawade <christoph.saw...@googlemail.com >: > I totally agree, and we discovered also some drawbacks with the > classification models implementation that are based on GLMs: > > - There is no distinction between predicting scores, classes, and > calibrated scores (probabilities). For these models it is common to have > access to all of them and the prediction function ``predict``should be > consistent and stateless. Currently, the score is only available after > removing the threshold from the model. > - There is no distinction between multinomial and binomial classification. > For multinomial problems, it is necessary to handle multiple weight vectors > and multiple confidences. > - Models are not serialisable, which makes it hard to use them in practise. > > I started a pull request [1] some time ago. I would be happy to continue > the discussion and clarify the interfaces, too! > > Cheers, Christoph > > [1] https://github.com/apache/spark/pull/2137/ > > 2014-09-12 11:11 GMT+02:00 Egor Pahomov <pahomov.e...@gmail.com>: > >> Here in Yandex, during implementation of gradient boosting in spark and >> creating our ML tool for internal use, we found next serious problems in >> MLLib: >> >> >> - There is no Regression/Classification model abstraction. We were >> building abstract data processing pipelines, which should work just >> with >> some regression - exact algorithm specified outside this code. There >> is no >> abstraction, which will allow me to do that. *(It's main reason for all >> further problems) * >> - There is no common practice among MLlib for testing algorithms: every >> model generates it's own random test data. There is no easy extractable >> test cases applible to another algorithm. There is no benchmarks for >> comparing algorithms. After implementing new algorithm it's very hard >> to >> understand how it should be tested. >> - Lack of serialization testing: MLlib algorithms don't contain tests >> which test that model work after serialization. >> - During implementation of new algorithm it's hard to understand what >> API you should create and which interface to implement. >> >> Start for solving all these problems must be done in creating common >> interface for typical algorithms/models - regression, classification, >> clustering, collaborative filtering. >> >> All main tests should be written against these interfaces, so when new >> algorithm implemented - all it should do is passed already written tests. >> It allow us to have managble quality among all lib. >> >> There should be couple benchmarks which allow new spark user to get >> feeling >> about which algorithm to use. >> >> Test set against these abstractions should contain serialization test. In >> production most time there is no need in model, which can't be stored. >> >> As the first step of this roadmap I'd like to create trait >> RegressionModel, >> *ADD* methods to current algorithms to implement this trait and create >> some >> tests against it. Planning of doing it next week. >> >> Purpose of this letter is to collect any objections to this approach on >> early stage: please give any feedback. Second reason is to set lock on >> this >> activity so we wouldn't do the same thing twice: I'll create pull request >> by the end of the next week and any parallalizm in development we can >> start >> from there. >> >> >> >> -- >> >> >> >> *Sincerely yoursEgor PakhomovScala Developer, Yandex* >> > > -- *Sincerely yoursEgor PakhomovScala Developer, Yandex*