Greetings, all,

I would like to make a rather significant enhancement to the "Active Time" 
metric in
Hackystat this summer.  The following (long and verbose) RFC provides 
background on this
issue and a proposal.  I am circulating this to various stakeholders to solicit 
feedback
and suggestions.

Summary:

The current measure of Active Time provides a repeatable and consistent measure 
for one
aspect of development time: the time spent editing files related to a project.  
I propose
to redefine Active Time as an umbrella term for a set of more specific types of 
Active
Time.  The current Active Time would become "File Editing Active Time".  Other 
candidates
include "Tool Active Time", "Command Line Active Time", and (potentially) 
"Browser Active
Time".  When doing analysis, it will be possible to generate "unions" of these 
Active
Time measures that provides a broader (though still partial) representation of 
developer
effort.

Motivation:

Although it is clear that some measure for effort is essential to many kinds of 
insights
into software development, obtaining such a measure has confounded software 
engineering
researchers for decades.  The main approaches to measuring effort involve: (1)
self-reporting, using logs or online tools, (2) manual measurement via an 
external
researcher who watches developers, or (3) automated measurement via 
instrumentation of
tools.  All of these approaches have significant problems associated with them.

An essential problem with self-reported data is that it tends to be inaccurate. 
 Cook and
Campbell (1979) have pointed out that subjects (a) tend to report what they 
believe the
researcher expects to see, or (b) report what reflects positively on their own 
abilities,
knowledge, beliefs, or opinions [1]. Schacter warned that the human memory is 
fallible
and thus the reliability of self-reported data is tenuous [2]. Austen (1996) 
describes
the wide range of organizational factors that can distort self-reported data 
[3].

An essential problem with manual measurement via an external researcher is that 
it tends
to be expensive: one must have the resources to allow a researcher to spend 
substantial
amounts of time simply observing the developers at work.  While this approach 
tends to
produce the most comprehensive and accurate data, it cannot scale to large 
numbers of
developers at large numbers of sites.

An essential problem with automated measurement is that it tends to be partial: 
it
measures only a subset of what the developers actually do.  Hackystat in its 
current form
exemplifies this issue: the Active Time measure only captures the time spent by
developers editing files associated with a project.  Measurement infrastructure 
such as
the PERCS program at the University of Pittsburgh is partial in a different 
way: while it
can obtain a reasonably complete picture of developer activity in the 
instrumented lab
setting, it requires developers to be in that lab and to be working on only one 
task
during that time.

Another way to think about the issue of developer effort measurement is in 
terms of
internal and external validity. "Internal validity" basically means that the 
value
computed by the measure is accurate--it reflects the measure's definition in 
various
contexts.  "External validity" means that the measure is useful--in our 
context, that it
can be applied to compute higher level abstractions like "productivity".

For example, assume we ask developers to keep logs of "all the time they spend 
working on
a project". This would result in a measure high external validity; in other 
words, this
definition of "effort" is obviously useful to determining things like 
productivity.
Unfortunately, it tends to have very low internal validity.  For one thing, 
there is a
lot of ambiguity in this definition, such that different developers may make 
different
choices as to how to record "effort".  This kind of effort measure is also quite
susceptible to measurement dysfunction---there are a lot of situations in which 
a person
may feel the need to distort such a value due to the organizational context in 
which they
work.  It's also quite hard to verify without an independent observer--how to 
you check
whether or not the values are accurate? Finally, from significant personal and
professorial experience, it's just darn difficult to keep doing for any length 
of time
with a reasonable level of accuracy.

On the other hand, Active Time has very high internal validity---assuming we 
implement
the sensor right, it will collect Active Time the same way for any developer 
using any
editor with an Active Time sensor (currently Eclipse, Emacs, Vim, Visual 
Studio, and
JBuilder).  It also allows multi-tasking: developers can work on several 
projects in any
given period of time, and as long as the project files can be described in 
terms of
directories, the Active Time measure can "credit" effort to the correct 
project(s).
Since the developer doesn't have any involvement in metric collection (other 
than editing
files), there is little ambiguity, the probability of measurement dysfunction 
is reduced,
and, as someone who has three years of Active Time data and counting, it's 
quite painless
for the subject.   Of course, the big question is external validity---to what 
extent and
in what contexts is this explicitly partial measure of overall effort still 
useful for
determining things like 'productivity'?

Of course, you don't necessarily have to choose one or the other.  Consider a 
hybrid
approach: you collect (possibly with the aid of an external observer) a 
reasonably
accurate "effort" log for _short_ but representative periods of time in the 
development
contexts of interest.  You also implement Active Time sensors.  Then you 
perform a
calibration to essentially determine what portion of 'overall' effort the 
Active Time
sensor is capturing.  For the unobserved periods, Active Time can now serve as 
a proxy.

Proposal: Active Time, 2.0

The essential goals for Active Time 2.0 are:

(1) Preserve internal validity. In other words, Active Time 2.0 should not 
require
developers to manually record effort, and it should collect data in the same 
way for any
developer.

(2) Preserve multi-tasking.  In other words, Active Time 2.0 should allow the 
developer
to work on several different projects during a given time period and have 
Active Time
"credited" to the correct project(s).

(3) Increase external validity.  In other words, Active Time 2.0 should provide 
a more
comprehensive measure for developer effort that is more useful for determining 
higher
level abstractions like "productivity".

To accomplish this, "Active Time" in the next version will become an umbrella 
term for a
collection of more specific measures.  The original Active Time measure will be 
renamed
"File Editing Active Time" to more precisely indicate the kind of developer 
effort being
measured.  Hackystat currently implements a measure called "Review Active 
Time", which is
the time spent reviewing files in the Jupiter Code Review plug-in to Eclipse.  
Another
Active Time flavor is Command Line Active Time, which represents the time spent 
invoking
commands at the command line in a subdirectory associated with a Project.  Tool 
Active
Time represents the time spent inside a tool but not necessarily editing a 
file.  There
must be some way to identify usage with a Project, however. A final candidate 
is Browser
Active Time, which would capture the time spent in a browser, again as long as 
the time
spent can be reliably associated with the Project. (I don't yet understand how 
or if we
can satisfy that constraint.)

We can create "composite" Active Time measures by combining together individual 
Active
Times, which can paint a more comprehensive picture of the time developers 
spend using
tools on project-related activities. An additional advantage of this new 
approach is
extensibility: as we discover new ways to define and represent "Active" time, 
we can more
easily integrate them into our analysis framework.

Your thoughts are welcomed.  Pointers to additional literature related to effort
measurement would be welcomed.

Literature Cited

[1] Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and 
analysis
issues. Boston, MA: Houghton Mifflin Company.

[2] Schacter, D. L. (1999). The seven sins of memory: Insights from psychology 
and
cognitive neuroscience. American Psychology, 54, 182-203.

[3] Austen, R. (1996) Measuring and Managing Performance in Organizations, Dorset House.

Reply via email to