If something like a "truck factor" really needs to be computed, I think it would be more meaningful to quantify how essential each of the contributions is. For instance, how complex a particular function/class/method is, how often it gets used/called in the modules, and how important it is for the user base. But also here, one needs to pay attention to the detail. A trivial example, the iris dataset is probably more often loaded from the dataset module than the SVM implementation(s); unarguably, the latter is probably a more important part of scikit-learn. Also, scikit-learn is an effort of many great people in the core team, I would find it unfair to weight their importances against each other.
Another example: when I was working on a contribution to scikit-learn, it was probably me who committed 95% of the code. However, I would say that the implementation was 95% of the core-developers work in terms of careful revisions, great ideas, and insightful suggestions. In short: Although it was me who committed the code, it was really the core team's efforts, thus, I wouldn't find it fair to quantify importance by number of commits or lines of code. Best, Sebastian > On Aug 12, 2015, at 9:45 AM, Joel Nothman <joel.noth...@gmail.com> wrote: > > I find that list somewhat obscure, and reading your section on Code > Authorship gives me some sense of why. All of those people have been very > important contributors to the project, and I'd think the absence of Gaël, > Andreas and Olivier alone would be very damaging, if only because of their > dedication to the collaborative maintenance involved. Yet despite his top > score Fabian has not actively contributed for years and would be quite > unfamiliar with many of the files he created, while I think Mathieu Blondel > and Alexandre Gramfort, for example, would provide substantial code coverage > without those seven (although they may not be interested in the maintenance). > > I feel the approach is problematic because of the weight it puts on "number > of commits" (if that's how I should interpret "the number of changes made in > f by D"). Apart from the susceptibility of this measure to individual author > preferences, the project in infancy favoured small commits (because the team > was small), but more recently has preferred large contributions, and has > frequently squashed contributions with large commit histories into single > commits. > > Have you considered measures of "number of deliveries" apart from number of > commits? While counting lines of code presents other problems, the number of > months in which a user committed changes to a file might be a more realistic > representation. > > A number of factors attenuate developer loss: documentation and overall code > quality; fairly open and wide contribution, with regular in-person > interaction for a large number of contributors; GSoC and other project-based > involvement entailing new contributors become very familiar with parts of the > code; and the standardness of the algorithms implemented in scikit-learn, > meaning they can be maintained on the basis of reference works (a broader > documentation). > > On 12 August 2015 at 22:57, Guilherme Avelino <gavel...@gmail.com > <mailto:gavel...@gmail.com>> wrote: > As part of my PhD research on code authorship, we calculated the Truck Factor > (TF) of some popular GitHub repositories. > > As you probably know, the Truck (or Bus) Factor designates the minimal number > of developers that have to be hit by a truck (or quit) before a project is > incapacitated. In our work, we consider that a system is in trouble if more > than 50% of its files become orphan (i.e., without a main author). > > More details on our work in this preprint: https://peerj.com/preprints/1233 > <https://peerj.com/preprints/1233> > > We calculated the TF for scikit-learn and obtained a value of 7. > > The developers responsible for this TF are: > > Fabian Pedregosa - author of 22% of the files > Gael varoquaux - author of 13% of the files > Andreas Mueller - author of 12% of the files > Olivier Grisel - author of 10% of the files > Lars Buitinck - author of 10% of the files > Jake Vanderplas - author of 6% of the files > Vlad Niculae - author of 5% of the files > > To validate our results, we would like to ask scikit-learn developers the > following three brief questions: > > (a) Do you agree that the listed developers are the main developers of > scikit-learn? > > (b) Do you agree that scikit-learn will be in trouble if the listed > developers leave the project (e.g., if they win in the lottery, to be less > morbid)? > > (c) Does scikit-learn have some characteristics that would attenuate the loss > of the listed developers (e.g., detailed documentation)? > > Thanks in advance for your collaboration, > > Guilherme Avelino > PhD Student > Applied Software Engineering Group (ASERG) > UFMG, Brazil > http://aserg.labsoft.dcc.ufmg.br/ <http://aserg.labsoft.dcc.ufmg.br/> > > -- > Prof. Guilherme Amaral Avelino > Universidade Federal do Piauí > Departamento de Computação > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > <mailto:Scikit-learn-general@lists.sourceforge.net> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general> > > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general