Greetings, all,
I've been studiously avoiding any Big Thoughts until after 7.0 was out the door, but now
the shackles have been thrown off!
So. I've been thinking about Projects. First off, let's recall that Projects are a way
of defining related sets of raw sensor data in a Hackystat repository. We currently
define a related set of raw sensor data with an implicit "AND" of three conditions: (1) a
set of developers (who must confirm membership); (2) a time interval (within which the
sensor data must have been received; and (3) a set of Workspaces (which provide a
"location" for the sensor data).
Let's also recall that Workspaces serve a very honorable purpose in Hackystat: they allow
groups of developers to work together on different platforms with different installations
of source directories and have the system be able to tell when developers are working on
the same file. There's nothing wrong with Workspaces per se.
There is, however, a problem with the way we define Projects as the AND of (1), (2), and
(3). The problem is that while this approach worked fine in the beginning when we had
relatively simple forms of raw sensor data, we are increasingly running into more
complicated kinds of sensor data. Two examples:
(1) The famous Unit Test sensor data problem. When running unit tests from a jar or
binary distribution, we no longer know the source directory that the code came from, so
we no longer have a Workspace. The solution was Workspace maps, which have been found to
be (a) brittle, and (b) complex. Currently, for example, someone sending Unit Test data
can't get that data associated with a Project unless they run a size counter! That
totally sucks.
(2) The less famous BrowserURL sensor data problem. Some folks have wanted a sensor for
their browser that could record when they were looking at documentation. While one could
imagine a sensor data type with "URL" as a required field, it is not at all clear how to
transmogrify that into a Workspace so that the data could be associated with a Project.
In the past, we've toyed with solutions involving specifying the project name on the
client side and sending it along with the raw data. That has proven to be a very bad
solution. For example, it does not that sensor data to be associated with any other
projects that might be defined in the future.
At an abstract level, what our current Project definition mechanism does is create a
"Project Data Membership Expression" of something like the following:
(and
(or (sensor-data-owner = "[EMAIL PROTECTED]")
(sensor-data-owner = "[EMAIL PROTECTED]"))
(sensor-data-start-date = "10-Nov-2005")
(sensor-data-end-date = "undefined")
(or (sensor-data-workspace = "hackyCore_Build")
(sensor-data-workspace = "hackyCore_Kernel")))
Abstractly, each sensor data record in the repository is tested against that expression,
and if the expression evaluates to true, then that sensor data is part of that project.
Of course, we are smart about the way we "evaluate" this expression so that we don't
actually traverse the entire repository!
What I'm proposing is to enhance the Project definition mechanism with the ability to
define "Membership Expressions" that would enable us to indicate that a given piece of
sensor data should be considered part of a Project using properties of the sensor data
entry other than its owner, timestamp, and workspace. Given the right set of operators,
we should be able to provide a simple, yet expressive way of associating sensor data to
Project that overcomes our current problems. My idea would be to retain the current
member definition approach (since we need to do the whole confirmation email routine),
retain the start/end specification (since that's the nicest way to do it), make workspace
selection _optional_, and then add a textarea in which someone could type in a "Project
Data Membership Expression" (very similar to the "Expert" telemetry analysis mode).
There is an implicit "OR" between the Workspace and PDME fields--if the sensor data
satisfies the Workspace test, it's in regardless of whether it satisfies the PDME test.
So, for example, how would this approach solve the famous Unit Test sensor data problem?
Well, for the case of the Hackystat project, we could supply the following expression:
(and (isSensorDataType("UnitTest"))
(fieldStartsWith("classname", "org.hackystat")))
The syntax probably needs some work, but the basic idea is that we have an operator
called "isSensorDataType" which evaluates to true if the data item is of that type, and
another called "fieldStartsWith" that takes two arguments, the name of the field, and the
string to match against the string.
I claim this solves the problem of Unit Tests by stating that a unit test sensor data
entry is part of the Hackystat project if it contains a field (either required or
optional) called "classname" and if its String value has the prefix "org.hackystat".
In the case of the Browser URL, we could supply something like the following:
(and (isSensorDataType("BrowserUrl"))
(fieldStartsWith("url", "http://java.sun.com/")))
Or whatever.
A final idea: with this kind of approach, it probably requires some way to get feedback
on the sensor data that is 'matched' by an expression. I am imagining an analysis in
which you can specify a sensor data type, and an interval, and the analysis will list all
of the sensor data for that time interval with that type and for each entry, which
Projects were matched against that data. This would allow you to create an expression,
then run this analysis to see if the appropriate sensor data was matched against it, then
edit the definition, and so forth.
So, some questions for discussion:
- Does this seem like a good idea to pursue? What issues can you see?
- Can you provide any other scenarios in which the current Project definition mechanism
doesn't work well, so that we can see if this approach would address the difficulties?
Cheers,
Philip