Re: [Scikit-learn-general] Exclusivity of scikit-learn

Joel Nothman Wed, 03 Dec 2014 15:56:51 -0800

While anything is better than publishing an extended fork of the main
repository, I would like to see someone cite an instance where a
open-slather contrib repository has been particularly successful
(especially one where diverse contributions are assured). In line with
Gaël's experience of sandbox coderot, I think it provides very little
benefit over distributed open-source repositories.


For example, let's say someone has implemented an algorithm (Affinity
Propagation is what triggered this discussion so you might consider that).
Someone else wants to come and add features to it, or even just clean the
code, but by this time the original contributor has moved onto greener
pastures and is not interested in responding to a pull request. Who has the
right, and who the responsibility, to say that this change should be
allowed? Does the contrib repository, too, require an army of maintainers
to familiarise themselves with a vast collection of moderate-quality code?
Without strict gatekeepers, a centralised repository provides almost
nothing, and with strict gatekeepers it entails exactly the issue that we
are trying to solve.

The model of a distributed plugin library (think Django) seems much more
successful when diversity and changing/variant needs are inevitable. Each
contribution is published individually on PyPI and/or open-source hosting,
and someone curates or facilitates a centralised library (like
djangopackages.com). When a contributor doesn't want to maintain anymore,
the project is forked; and the fittest survive.

At the same time, scikit-learn is already trying to facilitate external
contributions:

   - it is working towards an estimator verification API
   <https://github.com/scikit-learn/scikit-learn/issues/3810> so that it is
   easy to test that externally-contributed estimators conform to many
   scikit-learn API standards. Contributions to developing this are welcome!
   - Gaël has commissioned a sphinx plugin
   <https://github.com/sphinx-gallery/sphinx-gallery> to make it easy for
   projects to build documentation by example as in scikit-learn's example
   gallery <http://scikit-learn.org/stable/auto_examples/>. Perhaps this
   could facilitate also displaying external examples in the contrib library
   (but only if someone is willing to code up such a feature!).

Making a template repository that people can clone to get started writing
an external package might be a nice extension of these ideas. Another idea
would be to have a conventional prefix for packages that extend
scikit-learn (just as django packages tend to be prefixed in PyPI by
django-).

Still, I think facilitating the construction and access to external
projects will be much more wieldy than a centralised contribs repo, and may
even streamline contribution back to the main repository.

On 4 December 2014 at 03:18, Gael Varoquaux <[email protected]>
wrote:

> On Wed, Dec 03, 2014 at 09:56:55AM -0500, Satrajit Ghosh wrote:
> > - let the community (to put zero additional burden on the current
> maintainers)
> > maintain a fork of scikit-learn that provides no guarantees other than
> it is
> > kept upto date with scikit-learn/master.
>
> The problem with this is that we are still going to have our tracker
> filled with problems that are related to the fork, and not master. To put
> things in perspective, our tracker has 336 issue open, and 1318 closed.
> Just keeping track on those issues is very hard.
>
> Thus the need for a different repo (eg scikit-learn-contrib, as suggested
> by Mathieu).
>
> > - people are welcome to add any algorithms to this (trivial, non-trivial,
> > recent)
>
> What you are suggesting is very similar to things that have been tried as
> a 'sandbox' for instance in scipy. Experience has shown that it code
> rots, because nobody feels responsible for the code. It's been tried, it
> fails, but if you feel like doing it, you should go ahead. Do you need
> anything from us?
>
> I would believe more in separate repos in a 'scikit-learn-contrib' github
> organization, because it would give a feeling of responsibility to the
> different owners of the repos.
>
> > - folks don't have to recreate packaging
>
> I don't understand: if there are releases, and packaging, someone has to
> do it. It doesn't happen just like this. It's actually a lot of work.
>
> If it's just a fork, without any releases, what's the gain? In addition,
> if somebody is not doing the work of making sure that it builds and run
> on various platforms, quite quickly it will stop working on different
> versions of Python and different platforms.
>
> > - it brings all the folks who are forking anyway together instead of
> splitting
> > off into forks (multiple forks are harder to use)
>
> But someone has to be making the merges :). So the work is there.
>
> > - it makes for increased availability of algorithms that may be useful in
> > practice but never makes it out because the world is biased towards
> > loudspeakers
>
> Probably, provided that the project actually flies. But I really fear
> coderot. The amount of work to keep the scikit-learn project going is
> just huge. If nobody is doing this work, coderot would come in very
> quickly.
>
> > - it doesn't add anything to the current maintainers plates, nor take
> away
> > anything from the main project. perhaps those wishing to add things will
> take
> > it upon themselves to maintain this fork.
>
> As long as it is called differently, and _has a different import name_.
> If not, I can quite forcast the situation where users are complaining
> about scikit-learn and after a long debugging session we find that they
> are running some weird fork.
>
>
> I think that there is something flawed in the way you see the life of a
> project like scikit-learn. You seem to think that it is just an
> accumulation of code. That putting code together is enough to make a
> project successful. But if that's the case, why don't you just create
> something else, just anything else, and accumulate code? More
> importantly, why do you want algorithms in scikit-learn? Why aren't you
> happy with just code on Internet that you can download? If you ask
> yourself these questions, you will probably find where the value of
> scikit-learn lies, and this will also tell you why there is a huge effort
> in maintaining scikit-learn.
>
>
> Things like this, eg sandboxes where there is no feeling of belonging to
> a global project and no harmonizing effort, have been tried in the past.
> They fail because of coderot. Actually, to put a historical perspective,
> a long time ago, there was a scipy 'sandbox', in the scipy SVN. It didn't
> have much working, mostly dead code. We hypothesized that it was because
> of lack of visibility, so the 'sandbox' was cleaned, separated in some
> structure, and renamed 'scikits'. Scikits weren't getting much traction
> inside the scipy codebase, because people were having a hard time working
> there (back then it was an SVN, but there was also the problem of
> compiling scipy, which is a bit hard). So we started pulling things out
> of the SVN. And that's how the current scikits were born. Some of these
> scikits took off, because they had a clear project management: releases,
> documentation, quality.
>
> It's interesting that almost ten years later, we are falling in the same
> problems. I think that this is not by chance. The reasons that these
> evolutions happen are the following:
>
> 1. Projects are non-linearly hard to evolve. Bigger projects are harder to
>    drive than small projects, and significantly. This is a very very true
>    law of project management and is really underestimated by too many [1].
>
> 2. People want different things, and that's perfectly legitimate. The
>    statsmodels guys wanted control on p-values. The scikit-learn guys
>    wanted good prediction. Both usecases are valid (I am an avid user of
>    statsmodels), but doing both in the same project was much, much harder
>    than doing two projects.
>
> Thus I think that it is natural that some ecosystem of different
> projects, from general to specific, shapes up. Yes, it's very important to
> keep in mind the big picture, and that people with close enough unite,
> but only in balance with point 1.
>
> By the way, I care very much about the ecosystem. When we split of HMMs,
> I spent half a day making them a separate package, with setup.py, travis,
> a README, examples, documentation:
> https://github.com/hmmlearn
> It did take a good 4 hours. Nothing happens for free. I did this even
> though I do not use HMMs at all.
>
>
> In terms of action points, to summarize my position:
>
> - You are free to create a fork. I strongly ask that you change the
>   import name, elsewhere you will be putting burden on the main
>   scikit-learn maintainers.
>
> - What I think could work would be a scikit-learn-contrib organization with
>   different repository in it. I see that Matthieu and Andy have the same
>   feeling. I think we all agree that it should be done. I am ready to
>   create the organization, and give you (and many others) the keys of the
>   kingdom.
>
> Gaël
>
>
> [1] This has actually been studied. Here is one paper (out of probably
>     many): http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1702600
>
>
>
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
>
> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Exclusivity of scikit-learn

Reply via email to