Re: [Scikit-learn-general] scikit-learn Truck Factor

Sebastian Raschka Wed, 12 Aug 2015 09:18:50 -0700

If something like a "truck factor" really needs to be computed, I think it 
would be more meaningful to quantify how essential each of the contributions 
is. For instance, how complex a particular function/class/method is, how often 
it gets used/called in the modules, and how important it is for the user base. 
But also here, one needs to pay attention to the detail. A trivial example, the 
iris dataset is probably more often loaded from the dataset module than the SVM 
implementation(s); unarguably, the latter is probably a more important part of 
scikit-learn. Also, scikit-learn is an effort of many great people in the core 
team, I would find it unfair to weight their importances against each other.


Another example: when I was working on a contribution to scikit-learn, it was 
probably me who committed 95% of the code. However, I would say that the 
implementation was 95% of the core-developers work in terms of careful 
revisions, great ideas, and insightful suggestions. In short: Although it was 
me who committed the code, it was really the core team's efforts, thus, I 
wouldn't find it fair to quantify importance by number of commits or lines of 
code.

Best,
Sebastian 

> On Aug 12, 2015, at 9:45 AM, Joel Nothman <[email protected]> wrote:
> 
> I find that list somewhat obscure, and reading your section on Code 
> Authorship gives me some sense of why. All of those people have been very 
> important contributors to the project, and I'd think the absence of Gaël, 
> Andreas and Olivier alone would be very damaging, if only because of their 
> dedication to the collaborative maintenance involved. Yet despite his top 
> score Fabian has not actively contributed for years and would be quite 
> unfamiliar with many of the files he created, while I think Mathieu Blondel 
> and Alexandre Gramfort, for example, would provide substantial code coverage 
> without those seven (although they may not be interested in the maintenance).
> 
> I feel the approach is problematic because of the weight it puts on "number 
> of commits" (if that's how I should interpret "the number of changes made in 
> f by D"). Apart from the susceptibility of this measure to individual author 
> preferences, the project in infancy favoured small commits (because the team 
> was small), but more recently has preferred large contributions, and has 
> frequently squashed contributions with large commit histories into single 
> commits.
> 
> Have you considered measures of "number of deliveries" apart from number of 
> commits? While counting lines of code presents other problems, the number of 
> months in which a user committed changes to a file might be a more realistic 
> representation.
> 
> A number of factors attenuate developer loss: documentation and overall code 
> quality; fairly open and wide contribution, with regular in-person 
> interaction for a large number of contributors; GSoC and other project-based 
> involvement entailing new contributors become very familiar with parts of the 
> code; and the standardness of the algorithms implemented in scikit-learn, 
> meaning they can be maintained on the basis of reference works (a broader 
> documentation).
> 
> On 12 August 2015 at 22:57, Guilherme Avelino <[email protected] 
> <mailto:[email protected]>> wrote:
> As part of my PhD research on code authorship, we calculated the Truck Factor 
> (TF) of some popular GitHub repositories.
> 
> As you probably know, the Truck (or Bus) Factor designates the minimal number 
> of developers that have to be hit by a truck (or quit) before a project is 
> incapacitated. In our work, we consider that a system is in trouble if more 
> than 50% of its files become orphan (i.e., without a main author).
> 
> More details on our work in this preprint: https://peerj.com/preprints/1233 
> <https://peerj.com/preprints/1233>
> 
> We calculated the TF for scikit-learn and obtained a value of 7.
> 
> The developers responsible for this TF are:
> 
> Fabian Pedregosa - author of 22% of the files
> Gael varoquaux - author of 13% of the files
> Andreas Mueller - author of 12% of the files
> Olivier Grisel - author of 10% of the files
> Lars Buitinck - author of 10% of the files
> Jake Vanderplas - author of 6% of the files
> Vlad Niculae - author of 5% of the files
> 
> To validate our results, we would like to ask scikit-learn developers the 
> following three brief questions:
> 
> (a) Do you agree that the listed developers are the main developers of 
> scikit-learn?
> 
> (b) Do you agree that scikit-learn will be in trouble if the listed 
> developers leave the project (e.g., if they win in the lottery, to be less 
> morbid)?
> 
> (c) Does scikit-learn have some characteristics that would attenuate the loss 
> of the listed developers (e.g., detailed documentation)?
> 
> Thanks in advance for your collaboration,
> 
> Guilherme Avelino
> PhD Student
> Applied Software Engineering Group (ASERG)
> UFMG, Brazil
> http://aserg.labsoft.dcc.ufmg.br/ <http://aserg.labsoft.dcc.ufmg.br/>
> 
> -- 
> Prof. Guilherme Amaral Avelino
> Universidade Federal do Piauí
> Departamento de Computação
> 
> ------------------------------------------------------------------------------
> 
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected] 
> <mailto:[email protected]>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general 
> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] scikit-learn Truck Factor

Reply via email to