Re: [Scikit-learn-general] Exclusivity of scikit-learn

Joel Nothman Wed, 03 Dec 2014 18:21:24 -0800

I know what you mean by needing new features or refactoring inside the main
project. I've got a case that requires a more polymorphic definition of
sklearn.base.clone. I think such changes should be possible within the main
repo, and need to be argued by their proponent, with tests documented to
say that a feature is required for external projects.


I don't see what's hard about comparing models from outside scikit-learn,
on the assumption that all the packages worth comparing are trivial to
install, and listed in scikit-learn's "Extension Library".

On 4 December 2014 at 12:01, Satrajit Ghosh <[email protected]> wrote:

> hi gael and joel,
>
> i'll insert a short response here. i actually agree with all the things
> both of you said. i will however comment on two things:
>
> 1. algorithmic scenarios:
>
> a. adding algorithms that can be built directly of the scikit-learn api
> b. adding algorithms that require refactoring some not all underlying
> pieces.
>
> in case a), i could simply have a python script, i don't need a fork, but
> in case b), i need a fork.
>
> 2. i love decentralization, but the current architecture doesn't allow me
> to do the very simple use-case. i want to compare models in scikit-learn to
> models outside scikit-learn. what's nice about the api is that it makes
> comparing models easy, i can search over various models. however, if i have
> to install or merge 5 different scikit-learn forks to be able to compare
> those algorithms that are not in scikit learn that becomes expensive. if i
> could do this in an easier manner, i wouldn't really ask for a common
> bleeding repo.
>
> cheers,
>
> satra
>
> On Wed, Dec 3, 2014 at 6:55 PM, Joel Nothman <[email protected]>
> wrote:
>
>> While anything is better than publishing an extended fork of the main
>> repository, I would like to see someone cite an instance where a
>> open-slather contrib repository has been particularly successful
>> (especially one where diverse contributions are assured). In line with
>> Gaël's experience of sandbox coderot, I think it provides very little
>> benefit over distributed open-source repositories.
>>
>> For example, let's say someone has implemented an algorithm (Affinity
>> Propagation is what triggered this discussion so you might consider that).
>> Someone else wants to come and add features to it, or even just clean the
>> code, but by this time the original contributor has moved onto greener
>> pastures and is not interested in responding to a pull request. Who has the
>> right, and who the responsibility, to say that this change should be
>> allowed? Does the contrib repository, too, require an army of maintainers
>> to familiarise themselves with a vast collection of moderate-quality code?
>> Without strict gatekeepers, a centralised repository provides almost
>> nothing, and with strict gatekeepers it entails exactly the issue that we
>> are trying to solve.
>>
>> The model of a distributed plugin library (think Django) seems much more
>> successful when diversity and changing/variant needs are inevitable. Each
>> contribution is published individually on PyPI and/or open-source hosting,
>> and someone curates or facilitates a centralised library (like
>> djangopackages.com). When a contributor doesn't want to maintain
>> anymore, the project is forked; and the fittest survive.
>>
>> At the same time, scikit-learn is already trying to facilitate external
>> contributions:
>>
>>    - it is working towards an estimator verification API
>>    <https://github.com/scikit-learn/scikit-learn/issues/3810> so that it
>>    is easy to test that externally-contributed estimators conform to many
>>    scikit-learn API standards. Contributions to developing this are welcome!
>>    - Gaël has commissioned a sphinx plugin
>>    <https://github.com/sphinx-gallery/sphinx-gallery> to make it easy
>>    for projects to build documentation by example as in scikit-learn's 
>> example
>>    gallery <http://scikit-learn.org/stable/auto_examples/>. Perhaps this
>>    could facilitate also displaying external examples in the contrib library
>>    (but only if someone is willing to code up such a feature!).
>>
>> Making a template repository that people can clone to get started writing
>> an external package might be a nice extension of these ideas. Another idea
>> would be to have a conventional prefix for packages that extend
>> scikit-learn (just as django packages tend to be prefixed in PyPI by
>> django-).
>>
>> Still, I think facilitating the construction and access to external
>> projects will be much more wieldy than a centralised contribs repo, and may
>> even streamline contribution back to the main repository.
>>
>> On 4 December 2014 at 03:18, Gael Varoquaux <
>> [email protected]> wrote:
>>
>>> On Wed, Dec 03, 2014 at 09:56:55AM -0500, Satrajit Ghosh wrote:
>>> > - let the community (to put zero additional burden on the current
>>> maintainers)
>>> > maintain a fork of scikit-learn that provides no guarantees other than
>>> it is
>>> > kept upto date with scikit-learn/master.
>>>
>>> The problem with this is that we are still going to have our tracker
>>> filled with problems that are related to the fork, and not master. To put
>>> things in perspective, our tracker has 336 issue open, and 1318 closed.
>>> Just keeping track on those issues is very hard.
>>>
>>> Thus the need for a different repo (eg scikit-learn-contrib, as suggested
>>> by Mathieu).
>>>
>>> > - people are welcome to add any algorithms to this (trivial,
>>> non-trivial,
>>> > recent)
>>>
>>> What you are suggesting is very similar to things that have been tried as
>>> a 'sandbox' for instance in scipy. Experience has shown that it code
>>> rots, because nobody feels responsible for the code. It's been tried, it
>>> fails, but if you feel like doing it, you should go ahead. Do you need
>>> anything from us?
>>>
>>> I would believe more in separate repos in a 'scikit-learn-contrib' github
>>> organization, because it would give a feeling of responsibility to the
>>> different owners of the repos.
>>>
>>> > - folks don't have to recreate packaging
>>>
>>> I don't understand: if there are releases, and packaging, someone has to
>>> do it. It doesn't happen just like this. It's actually a lot of work.
>>>
>>> If it's just a fork, without any releases, what's the gain? In addition,
>>> if somebody is not doing the work of making sure that it builds and run
>>> on various platforms, quite quickly it will stop working on different
>>> versions of Python and different platforms.
>>>
>>> > - it brings all the folks who are forking anyway together instead of
>>> splitting
>>> > off into forks (multiple forks are harder to use)
>>>
>>> But someone has to be making the merges :). So the work is there.
>>>
>>> > - it makes for increased availability of algorithms that may be useful
>>> in
>>> > practice but never makes it out because the world is biased towards
>>> > loudspeakers
>>>
>>> Probably, provided that the project actually flies. But I really fear
>>> coderot. The amount of work to keep the scikit-learn project going is
>>> just huge. If nobody is doing this work, coderot would come in very
>>> quickly.
>>>
>>> > - it doesn't add anything to the current maintainers plates, nor take
>>> away
>>> > anything from the main project. perhaps those wishing to add things
>>> will take
>>> > it upon themselves to maintain this fork.
>>>
>>> As long as it is called differently, and _has a different import name_.
>>> If not, I can quite forcast the situation where users are complaining
>>> about scikit-learn and after a long debugging session we find that they
>>> are running some weird fork.
>>>
>>>
>>> I think that there is something flawed in the way you see the life of a
>>> project like scikit-learn. You seem to think that it is just an
>>> accumulation of code. That putting code together is enough to make a
>>> project successful. But if that's the case, why don't you just create
>>> something else, just anything else, and accumulate code? More
>>> importantly, why do you want algorithms in scikit-learn? Why aren't you
>>> happy with just code on Internet that you can download? If you ask
>>> yourself these questions, you will probably find where the value of
>>> scikit-learn lies, and this will also tell you why there is a huge effort
>>> in maintaining scikit-learn.
>>>
>>>
>>> Things like this, eg sandboxes where there is no feeling of belonging to
>>> a global project and no harmonizing effort, have been tried in the past.
>>> They fail because of coderot. Actually, to put a historical perspective,
>>> a long time ago, there was a scipy 'sandbox', in the scipy SVN. It didn't
>>> have much working, mostly dead code. We hypothesized that it was because
>>> of lack of visibility, so the 'sandbox' was cleaned, separated in some
>>> structure, and renamed 'scikits'. Scikits weren't getting much traction
>>> inside the scipy codebase, because people were having a hard time working
>>> there (back then it was an SVN, but there was also the problem of
>>> compiling scipy, which is a bit hard). So we started pulling things out
>>> of the SVN. And that's how the current scikits were born. Some of these
>>> scikits took off, because they had a clear project management: releases,
>>> documentation, quality.
>>>
>>> It's interesting that almost ten years later, we are falling in the same
>>> problems. I think that this is not by chance. The reasons that these
>>> evolutions happen are the following:
>>>
>>> 1. Projects are non-linearly hard to evolve. Bigger projects are harder
>>> to
>>>    drive than small projects, and significantly. This is a very very true
>>>    law of project management and is really underestimated by too many
>>> [1].
>>>
>>> 2. People want different things, and that's perfectly legitimate. The
>>>    statsmodels guys wanted control on p-values. The scikit-learn guys
>>>    wanted good prediction. Both usecases are valid (I am an avid user of
>>>    statsmodels), but doing both in the same project was much, much harder
>>>    than doing two projects.
>>>
>>> Thus I think that it is natural that some ecosystem of different
>>> projects, from general to specific, shapes up. Yes, it's very important
>>> to
>>> keep in mind the big picture, and that people with close enough unite,
>>> but only in balance with point 1.
>>>
>>> By the way, I care very much about the ecosystem. When we split of HMMs,
>>> I spent half a day making them a separate package, with setup.py, travis,
>>> a README, examples, documentation:
>>> https://github.com/hmmlearn
>>> It did take a good 4 hours. Nothing happens for free. I did this even
>>> though I do not use HMMs at all.
>>>
>>>
>>> In terms of action points, to summarize my position:
>>>
>>> - You are free to create a fork. I strongly ask that you change the
>>>   import name, elsewhere you will be putting burden on the main
>>>   scikit-learn maintainers.
>>>
>>> - What I think could work would be a scikit-learn-contrib organization
>>> with
>>>   different repository in it. I see that Matthieu and Andy have the same
>>>   feeling. I think we all agree that it should be done. I am ready to
>>>   create the organization, and give you (and many others) the keys of the
>>>   kingdom.
>>>
>>> Gaël
>>>
>>>
>>> [1] This has actually been studied. Here is one paper (out of probably
>>>     many): http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=1702600
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>> Get technology previously reserved for billion-dollar corporations, FREE
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
>
> http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Exclusivity of scikit-learn

Reply via email to