hi gael and joel,
i'll insert a short response here. i actually agree with all the things
both of you said. i will however comment on two things:
1. algorithmic scenarios:
a. adding algorithms that can be built directly of the scikit-learn api
b. adding algorithms that require refactoring some not all underlying
pieces.
in case a), i could simply have a python script, i don't need a fork, but
in case b), i need a fork.
2. i love decentralization, but the current architecture doesn't allow me
to do the very simple use-case. i want to compare models in scikit-learn to
models outside scikit-learn. what's nice about the api is that it makes
comparing models easy, i can search over various models. however, if i have
to install or merge 5 different scikit-learn forks to be able to compare
those algorithms that are not in scikit learn that becomes expensive. if i
could do this in an easier manner, i wouldn't really ask for a common
bleeding repo.
cheers,
satra
On Wed, Dec 3, 2014 at 6:55 PM, Joel Nothman <joel.noth...@gmail.com> wrote:
> While anything is better than publishing an extended fork of the main
> repository, I would like to see someone cite an instance where a
> open-slather contrib repository has been particularly successful
> (especially one where diverse contributions are assured). In line with
> Gaël's experience of sandbox coderot, I think it provides very little
> benefit over distributed open-source repositories.
>
> For example, let's say someone has implemented an algorithm (Affinity
> Propagation is what triggered this discussion so you might consider that).
> Someone else wants to come and add features to it, or even just clean the
> code, but by this time the original contributor has moved onto greener
> pastures and is not interested in responding to a pull request. Who has the
> right, and who the responsibility, to say that this change should be
> allowed? Does the contrib repository, too, require an army of maintainers
> to familiarise themselves with a vast collection of moderate-quality code?
> Without strict gatekeepers, a centralised repository provides almost
> nothing, and with strict gatekeepers it entails exactly the issue that we
> are trying to solve.
>
> The model of a distributed plugin library (think Django) seems much more
> successful when diversity and changing/variant needs are inevitable. Each
> contribution is published individually on PyPI and/or open-source hosting,
> and someone curates or facilitates a centralised library (like
> djangopackages.com). When a contributor doesn't want to maintain anymore,
> the project is forked; and the fittest survive.
>
> At the same time, scikit-learn is already trying to facilitate external
> contributions:
>
> - it is working towards an estimator verification API
> <https://github.com/scikit-learn/scikit-learn/issues/3810> so that it
> is easy to test that externally-contributed estimators conform to many
> scikit-learn API standards. Contributions to developing this are welcome!
> - Gaël has commissioned a sphinx plugin
> <https://github.com/sphinx-gallery/sphinx-gallery> to make it easy for
> projects to build documentation by example as in scikit-learn's example
> gallery <http://scikit-learn.org/stable/auto_examples/>. Perhaps this
> could facilitate also displaying external examples in the contrib library
> (but only if someone is willing to code up such a feature!).
>
> Making a template repository that people can clone to get started writing
> an external package might be a nice extension of these ideas. Another idea
> would be to have a conventional prefix for packages that extend
> scikit-learn (just as django packages tend to be prefixed in PyPI by
> django-).
>
> Still, I think facilitating the construction and access to external
> projects will be much more wieldy than a centralised contribs repo, and may
> even streamline contribution back to the main repository.
>
> On 4 December 2014 at 03:18, Gael Varoquaux <gael.varoqu...@normalesup.org
> > wrote:
>
>> On Wed, Dec 03, 2014 at 09:56:55AM -0500, Satrajit Ghosh wrote:
>> > - let the community (to put zero additional burden on the current
>> maintainers)
>> > maintain a fork of scikit-learn that provides no guarantees other than
>> it is
>> > kept upto date with scikit-learn/master.
>>
>> The problem with this is that we are still going to have our tracker
>> filled with problems that are related to the fork, and not master. To put
>> things in perspective, our tracker has 336 issue open, and 1318 closed.
>> Just keeping track on those issues is very hard.
>>
>> Thus the need for a different repo (eg scikit-learn-contrib, as suggested
>> by Mathieu).
>>
>> > - people are welcome to add any algorithms to this (trivial,
>> non-trivial,
>> > recent)
>>
>> What you are suggesting is very similar to things that have been tried as
>> a 'sandbox' for instance in scipy. Experience has shown that it code
>> rots, because nobody feels responsible for the code. It's been tried, it
>> fails, but if you feel like doing it, you should go ahead. Do you need
>> anything from us?
>>
>> I would believe more in separate repos in a 'scikit-learn-contrib' github
>> organization, because it would give a feeling of responsibility to the
>> different owners of the repos.
>>
>> > - folks don't have to recreate packaging
>>
>> I don't understand: if there are releases, and packaging, someone has to
>> do it. It doesn't happen just like this. It's actually a lot of work.
>>
>> If it's just a fork, without any releases, what's the gain? In addition,
>> if somebody is not doing the work of making sure that it builds and run
>> on various platforms, quite quickly it will stop working on different
>> versions of Python and different platforms.
>>
>> > - it brings all the folks who are forking anyway together instead of
>> splitting
>> > off into forks (multiple forks are harder to use)
>>
>> But someone has to be making the merges :). So the work is there.
>>
>> > - it makes for increased availability of algorithms that may be useful
>> in
>> > practice but never makes it out because the world is biased towards
>> > loudspeakers
>>
>> Probably, provided that the project actually flies. But I really fear
>> coderot. The amount of work to keep the scikit-learn project going is
>> just huge. If nobody is doing this work, coderot would come in very
>> quickly.
>>
>> > - it doesn't add anything to the current maintainers plates, nor take
>> away
>> > anything from the main project. perhaps those wishing to add things
>> will take
>> > it upon themselves to maintain this fork.
>>
>> As long as it is called differently, and _has a different import name_.
>> If not, I can quite forcast the situation where users are complaining
>> about scikit-learn and after a long debugging session we find that they
>> are running some weird fork.
>>
>>
>> I think that there is something flawed in the way you see the life of a
>> project like scikit-learn. You seem to think that it is just an
>> accumulation of code. That putting code together is enough to make a
>> project successful. But if that's the case, why don't you just create
>> something else, just anything else, and accumulate code? More
>> importantly, why do you want algorithms in scikit-learn? Why aren't you
>> happy with just code on Internet that you can download? If you ask
>> yourself these questions, you will probably find where the value of
>> scikit-learn lies, and this will also tell you why there is a huge effort
>> in maintaining scikit-learn.
>>
>>
>> Things like this, eg sandboxes where there is no feeling of belonging to
>> a global project and no harmonizing effort, have been tried in the past.
>> They fail because of coderot. Actually, to put a historical perspective,
>> a long time ago, there was a scipy 'sandbox', in the scipy SVN. It didn't
>> have much working, mostly dead code. We hypothesized that it was because
>> of lack of visibility, so the 'sandbox' was cleaned, separated in some
>> structure, and renamed 'scikits'. Scikits weren't getting much traction
>> inside the scipy codebase, because people were having a hard time working
>> there (back then it was an SVN, but there was also the problem of
>> compiling scipy, which is a bit hard). So we started pulling things out
>> of the SVN. And that's how the current scikits were born. Some of these
>> scikits took off, because they had a clear project management: releases,
>> documentation, quality.
>>
>> It's interesting that almost ten years later, we are falling in the same
>> problems. I think that this is not by chance. The reasons that these
>> evolutions happen are the following:
>>
>> 1. Projects are non-linearly hard to evolve. Bigger projects are harder to
>> drive than small projects, and significantly. This is a very very true
>> law of project management and is really underestimated by too many [1].
>>
>> 2. People want different things, and that's perfectly legitimate. The
>> statsmodels guys wanted control on p-values. The scikit-learn guys
>> wanted good prediction. Both usecases are valid (I am an avid user of
>> statsmodels), but doing both in the same project was much, much harder
>> than doing two projects.
>>
>> Thus I think that it is natural that some ecosystem of different
>> projects, from general to specific, shapes up. Yes, it's very important to
>> keep in mind the big picture, and that people with close enough unite,
>> but only in balance with point 1.
>>
>> By the way, I care very much about the ecosystem. When we split of HMMs,
>> I spent half a day making them a separate package, with setup.py, travis,
>> a README, examples, documentation:
>> https://github.com/hmmlearn
>> It did take a good 4 hours. Nothing happens for free. I did this even
>> though I do not use HMMs at all.
>>
>>
>> In terms of action points, to summarize my position:
>>
>> - You are free to create a fork. I strongly ask that you change the
>> import name, elsewhere you will be putting burden on the main
>> scikit-learn maintainers.
>>
>> - What I think could work would be a scikit-learn-contrib organization
>> with
>> different repository in it. I see that Matthieu and Andy have the same
>> feeling. I think we all agree that it should be done. I am ready to
>> create the organization, and give you (and many others) the keys of the
>> kingdom.
>>
>> Gaël
>>
>>
>> [1] This has actually been studied. Here is one paper (out of probably
>> many): http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1702600
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>> Get technology previously reserved for billion-dollar corporations, FREE
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
>
> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general