[arch-dev-public] O'Reilly book: Making Software

2010-11-22 Thread Aaron Griffin
My boss just brought this book over to my desk. It's one of those
theory books that's all about measuring lines of code and whatnot.

But the reason he brought it over, is Chapter 8: Beyond Lines of
Code: Do We Need More Complex Metrics?. Two pages in, it begins with:

**Measuring the Source Code**
We have selected for our case study the ArchLinux software
distribution (http://archlinux.org), which contains thousands of
packages, all open source. ArchLinux is a lightweight GNU/Linux
distribution whose maintainers refuse to modify the source code
packaged for the distribution, in order to meet the goal of
drastically reducing the time that elapses between the official
release of a package and its integration into the distribution.
...
Because of the size of ArchLinux, using it as a case study gives us
access to the original source code of thousands of open source
projects, through the build scripts used by ABS (see Example 8-1)

The chapter goes on with statistics and all that junk. They're not
studying Arch, but using Arch as a launch pad for getting large
amounts of open source source code for analysis. The numbers are
interesting:

The ArchLinux repositories contained 4096 packages (as of April 2010),
with some of the packages being different versions of the same
upstream project. After removing different versions, we obtained a
sample of 4015 packages, containing 1272748 source code files. Among
all those files, 576511 were written in C. However, there were
repeated files. In the overall sample, only 776573 were unique files;
in the C subsample, only 338831 were unique files. From these unique C
files, 212167 were nonheader files and 126664 were header files.


Re: [arch-dev-public] O'Reilly book: Making Software

2010-11-22 Thread Guillaume ALAUX
On 22 November 2010 19:39, Aaron Griffin aaronmgrif...@gmail.com wrote:
 My boss just brought this book over to my desk. It's one of those
 theory books that's all about measuring lines of code and whatnot.

 But the reason he brought it over, is Chapter 8: Beyond Lines of
 Code: Do We Need More Complex Metrics?. Two pages in, it begins with:

 **Measuring the Source Code**
 We have selected for our case study the ArchLinux software
 distribution (http://archlinux.org), which contains thousands of
 packages, all open source. ArchLinux is a lightweight GNU/Linux
 distribution whose maintainers refuse to modify the source code
 packaged for the distribution, in order to meet the goal of
 drastically reducing the time that elapses between the official
 release of a package and its integration into the distribution.
 ...
 Because of the size of ArchLinux, using it as a case study gives us
 access to the original source code of thousands of open source
 projects, through the build scripts used by ABS (see Example 8-1)

 The chapter goes on with statistics and all that junk. They're not
 studying Arch, but using Arch as a launch pad for getting large
 amounts of open source source code for analysis. The numbers are
 interesting:

 The ArchLinux repositories contained 4096 packages (as of April 2010),
 with some of the packages being different versions of the same
 upstream project. After removing different versions, we obtained a
 sample of 4015 packages, containing 1272748 source code files. Among
 all those files, 576511 were written in C. However, there were
 repeated files. In the overall sample, only 776573 were unique files;
 in the C subsample, only 338831 were unique files. From these unique C
 files, 212167 were nonheader files and 126664 were header files.


Great! Proud to see Arch here :) The author might also be an Archer,
who knows. BTW I also like the idea of your boss coming to give you
the book like Oh and there this O'Reilly book about some of the stuff
you do

--
Guillaume


Re: [arch-dev-public] O'Reilly book: Making Software

2010-11-22 Thread Dieter Plaetinck
On Mon, 22 Nov 2010 12:39:57 -0600
Aaron Griffin aaronmgrif...@gmail.com wrote:

 My boss just brought this book over to my desk. It's one of those
 theory books that's all about measuring lines of code and whatnot.
 
 But the reason he brought it over, is Chapter 8: Beyond Lines of
 Code: Do We Need More Complex Metrics?. Two pages in, it begins with:
 
 **Measuring the Source Code**
 We have selected for our case study the ArchLinux software
 distribution (http://archlinux.org), which contains thousands of
 packages, all open source. ArchLinux is a lightweight GNU/Linux
 distribution whose maintainers refuse to modify the source code
 packaged for the distribution, in order to meet the goal of
 drastically reducing the time that elapses between the official
 release of a package and its integration into the distribution.
 ...
 Because of the size of ArchLinux, using it as a case study gives us
 access to the original source code of thousands of open source
 projects, through the build scripts used by ABS (see Example 8-1)
 
 The chapter goes on with statistics and all that junk. They're not
 studying Arch, but using Arch as a launch pad for getting large
 amounts of open source source code for analysis. The numbers are
 interesting:
 
 The ArchLinux repositories contained 4096 packages (as of April 2010),
 with some of the packages being different versions of the same
 upstream project. After removing different versions, we obtained a
 sample of 4015 packages, containing 1272748 source code files. Among
 all those files, 576511 were written in C. However, there were
 repeated files. In the overall sample, only 776573 were unique files;
 in the C subsample, only 338831 were unique files. From these unique C
 files, 212167 were nonheader files and 126664 were header files.


hmmm.. 39% of all files in our packages are duplicates. That's
interesting.  Wonder where they come from.

Dieter


Re: [arch-dev-public] O'Reilly book: Making Software

2010-11-22 Thread Allan McRae

On 23/11/10 07:09, Dieter Plaetinck wrote:

On Mon, 22 Nov 2010 12:39:57 -0600
Aaron Griffinaaronmgrif...@gmail.com  wrote:


My boss just brought this book over to my desk. It's one of those
theory books that's all about measuring lines of code and whatnot.

But the reason he brought it over, is Chapter 8: Beyond Lines of
Code: Do We Need More Complex Metrics?. Two pages in, it begins with:

**Measuring the Source Code**
We have selected for our case study the ArchLinux software
distribution (http://archlinux.org), which contains thousands of
packages, all open source. ArchLinux is a lightweight GNU/Linux
distribution whose maintainers refuse to modify the source code
packaged for the distribution, in order to meet the goal of
drastically reducing the time that elapses between the official
release of a package and its integration into the distribution.
...
Because of the size of ArchLinux, using it as a case study gives us
access to the original source code of thousands of open source
projects, through the build scripts used by ABS (see Example 8-1)

The chapter goes on with statistics and all that junk. They're not
studying Arch, but using Arch as a launch pad for getting large
amounts of open source source code for analysis. The numbers are
interesting:

The ArchLinux repositories contained 4096 packages (as of April 2010),
with some of the packages being different versions of the same
upstream project. After removing different versions, we obtained a
sample of 4015 packages, containing 1272748 source code files. Among
all those files, 576511 were written in C. However, there were
repeated files. In the overall sample, only 776573 were unique files;
in the C subsample, only 338831 were unique files. From these unique C
files, 212167 were nonheader files and 126664 were header files.



hmmm.. 39% of all files in our packages are duplicates. That's
interesting.  Wonder where they come from.



A lot of projects include sources from their dependencies inside their 
tarball.


Allan