Re: [Scikit-learn-general] scikit-learn Truck Factor

Andreas Mueller Wed, 12 Aug 2015 10:38:50 -0700

I disagree.
Pandas has a truck-factor of one, Jeff Reback.

My impression is that Wes did not catch up with the current codebase andwould therefore not

be the ideal maintainer any more.

If there is no "hand-over" of the arcane knowledge between generations,a project will die.

Numpy has a truck-factor between two and four I think.



On 08/12/2015 09:57 AM, [email protected] wrote:

On Wed, Aug 12, 2015 at 9:45 AM, Joel Nothman <[email protected]<mailto:[email protected]>> wrote:


    I find that list somewhat obscure, and reading your section on
    Code Authorship gives me some sense of why. All of those people
    have been very important contributors to the project, and I'd
    think the absence of Gaël, Andreas and Olivier alone would be very
    damaging, if only because of their dedication to the collaborative
    maintenance involved. Yet despite his top score Fabian has not
    actively contributed for years and would be quite unfamiliar with
    many of the files he created, while I think Mathieu Blondel
    and Alexandre Gramfort, for example, would provide substantial
    code coverage without those seven (although they may not be
    interested in the maintenance).

    I feel the approach is problematic because of the weight it puts
    on "number of commits" (if that's how I should interpret "the
    number of changes made in f by D"). Apart from the susceptibility
    of this measure to individual author preferences, the project in
    infancy favoured small commits (because the team was small), but
    more recently has preferred large contributions, and has
    frequently squashed contributions with large commit histories into
    single commits.

    Have you considered measures of "number of deliveries" apart from
    number of commits? While counting lines of code presents other
    problems, the number of months in which a user committed changes
    to a file might be a more realistic representation.

    A number of factors attenuate developer loss: documentation and
    overall code quality; fairly open and wide contribution, with
    regular in-person interaction for a large number of contributors;
    GSoC and other project-based involvement entailing new
    contributors become very familiar with parts of the code; and the
    standardness of the algorithms implemented in scikit-learn,
    meaning they can be maintained on the basis of reference works (a
    broader documentation).

As extreme example, pydata/pandas has truck factor of one. But the onehas already been "hit".

I think the truck factor can be very misleading for projects in thesecond generation.For old projects like scipy or numpy (which I didn't see), thetruckfactor might be quite large and take turnover into account, evenif the short run truck factor is relatively low.


Josef


    On 12 August 2015 at 22:57, Guilherme Avelino <[email protected]
    <mailto:[email protected]>> wrote:

        As part of my PhD research on code authorship, we calculated
        the Truck Factor (TF) of some popular GitHub repositories.

        As you probably know, the Truck (or Bus) Factor designates the
        minimal number of developers that have to be hit by a truck
        (or quit) before a project is incapacitated. In our work, we
        consider that a system is in trouble if more than 50% of its
        files become orphan (i.e., without a main author).

        More details on our work in this preprint:
        https://peerj.com/preprints/1233

        We calculated the TF for scikit-learn and obtained a value of 7.

        The developers responsible for this TF are:

        Fabian Pedregosa - author of 22% of the files
        Gael varoquaux - author of 13% of the files
        Andreas Mueller - author of 12% of the files
        Olivier Grisel - author of 10% of the files
        Lars Buitinck - author of 10% of the files
        Jake Vanderplas - author of 6% of the files
        Vlad Niculae - author of 5% of the files

        To validate our results, we would like to ask scikit-learn
        developers the following three brief questions:

        (a) Do you agree that the listed developers are the main
        developers of scikit-learn?

        (b) Do you agree that scikit-learn will be in trouble if the
        listed developers leave the project (e.g., if they win in the
        lottery, to be less morbid)?

        (c) Does scikit-learn have some characteristics that would
        attenuate the loss of the listed developers (e.g., detailed
        documentation)?

        Thanks in advance for your collaboration,

        Guilherme Avelino
        PhD Student
        Applied Software Engineering Group (ASERG)
        UFMG, Brazil
        http://aserg.labsoft.dcc.ufmg.br/

--Prof. Guilherme Amaral Avelino

        Universidade Federal do Piauí
        Departamento de Computação

        
------------------------------------------------------------------------------

        _______________________________________________
        Scikit-learn-general mailing list
        [email protected]
        <mailto:[email protected]>
        https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



    
------------------------------------------------------------------------------

    _______________________________________________
    Scikit-learn-general mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




------------------------------------------------------------------------------


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] scikit-learn Truck Factor

Reply via email to