On Wed, Dec 03, 2014 at 09:56:55AM -0500, Satrajit Ghosh wrote: > - let the community (to put zero additional burden on the current maintainers) > maintain a fork of scikit-learn that provides no guarantees other than it is > kept upto date with scikit-learn/master.
The problem with this is that we are still going to have our tracker filled with problems that are related to the fork, and not master. To put things in perspective, our tracker has 336 issue open, and 1318 closed. Just keeping track on those issues is very hard. Thus the need for a different repo (eg scikit-learn-contrib, as suggested by Mathieu). > - people are welcome to add any algorithms to this (trivial, non-trivial, > recent) What you are suggesting is very similar to things that have been tried as a 'sandbox' for instance in scipy. Experience has shown that it code rots, because nobody feels responsible for the code. It's been tried, it fails, but if you feel like doing it, you should go ahead. Do you need anything from us? I would believe more in separate repos in a 'scikit-learn-contrib' github organization, because it would give a feeling of responsibility to the different owners of the repos. > - folks don't have to recreate packaging I don't understand: if there are releases, and packaging, someone has to do it. It doesn't happen just like this. It's actually a lot of work. If it's just a fork, without any releases, what's the gain? In addition, if somebody is not doing the work of making sure that it builds and run on various platforms, quite quickly it will stop working on different versions of Python and different platforms. > - it brings all the folks who are forking anyway together instead of splitting > off into forks (multiple forks are harder to use) But someone has to be making the merges :). So the work is there. > - it makes for increased availability of algorithms that may be useful in > practice but never makes it out because the world is biased towards > loudspeakers Probably, provided that the project actually flies. But I really fear coderot. The amount of work to keep the scikit-learn project going is just huge. If nobody is doing this work, coderot would come in very quickly. > - it doesn't add anything to the current maintainers plates, nor take away > anything from the main project. perhaps those wishing to add things will take > it upon themselves to maintain this fork. As long as it is called differently, and _has a different import name_. If not, I can quite forcast the situation where users are complaining about scikit-learn and after a long debugging session we find that they are running some weird fork. I think that there is something flawed in the way you see the life of a project like scikit-learn. You seem to think that it is just an accumulation of code. That putting code together is enough to make a project successful. But if that's the case, why don't you just create something else, just anything else, and accumulate code? More importantly, why do you want algorithms in scikit-learn? Why aren't you happy with just code on Internet that you can download? If you ask yourself these questions, you will probably find where the value of scikit-learn lies, and this will also tell you why there is a huge effort in maintaining scikit-learn. Things like this, eg sandboxes where there is no feeling of belonging to a global project and no harmonizing effort, have been tried in the past. They fail because of coderot. Actually, to put a historical perspective, a long time ago, there was a scipy 'sandbox', in the scipy SVN. It didn't have much working, mostly dead code. We hypothesized that it was because of lack of visibility, so the 'sandbox' was cleaned, separated in some structure, and renamed 'scikits'. Scikits weren't getting much traction inside the scipy codebase, because people were having a hard time working there (back then it was an SVN, but there was also the problem of compiling scipy, which is a bit hard). So we started pulling things out of the SVN. And that's how the current scikits were born. Some of these scikits took off, because they had a clear project management: releases, documentation, quality. It's interesting that almost ten years later, we are falling in the same problems. I think that this is not by chance. The reasons that these evolutions happen are the following: 1. Projects are non-linearly hard to evolve. Bigger projects are harder to drive than small projects, and significantly. This is a very very true law of project management and is really underestimated by too many [1]. 2. People want different things, and that's perfectly legitimate. The statsmodels guys wanted control on p-values. The scikit-learn guys wanted good prediction. Both usecases are valid (I am an avid user of statsmodels), but doing both in the same project was much, much harder than doing two projects. Thus I think that it is natural that some ecosystem of different projects, from general to specific, shapes up. Yes, it's very important to keep in mind the big picture, and that people with close enough unite, but only in balance with point 1. By the way, I care very much about the ecosystem. When we split of HMMs, I spent half a day making them a separate package, with setup.py, travis, a README, examples, documentation: https://github.com/hmmlearn It did take a good 4 hours. Nothing happens for free. I did this even though I do not use HMMs at all. In terms of action points, to summarize my position: - You are free to create a fork. I strongly ask that you change the import name, elsewhere you will be putting burden on the main scikit-learn maintainers. - What I think could work would be a scikit-learn-contrib organization with different repository in it. I see that Matthieu and Andy have the same feeling. I think we all agree that it should be done. I am ready to create the organization, and give you (and many others) the keys of the kingdom. Gaël [1] This has actually been studied. Here is one paper (out of probably many): http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1702600 ------------------------------------------------------------------------------ Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general