- Does this seem like a good idea to pursue?

Seems interesting... I definitely think that the project definitions can be improved.

The problems that you identified are problems that we have to solve. Although, as I was thinking about this solution, I'm not totally sure addressing these problems at the project definition level is the best possible solution.

What issues can you see?

UnitTest (Coverage, Dependency) Problem:
Wouldn't we lose the ability to find the number of successful unit tests for a top level module? Without a WorkspaceMap we wouldn't be able to determine if the metrics from org.hackystat.core.kernel.admin.ServerProperties and hackyCore_Kernel/src/org/hackystat/core/kernel/admin/ServerProperties.java are actually the same.

In addition, imagine that we run the JUnit sensor on all of our Hackystat configurations (Hackystat-Standard, Hackystat-All, Hackystat-<something else>), then we could have many duplicate UnitTest sensor data entries associated with our Hackystat project if we follow the proposed isSensorDataType / fieldStartsWith("org.hackystat") example. This problem gets a lot tricker if we are dealing with Coverage information, since it is a Snapshot of the system. Currently, if we actually did run the JUnit sensor on all of our configurations, then we would simply not set the workspace root for the configurations that we don't want to analyze. I like this solution because, we can send all the data we want from the other configurations and not affect the validity of the project analyses. Who knows maybe later on we would want a project for a specific configuration.

Are we certain that this higher level project solution is the right solution for the lower level sensor data problem? In other words, could there possibly be another general solution at the sensor data level or sensor level. Stepping back a little... The best possible solution would be that the UnitTest, Coverage, and Dependency SDTs some how supply a workspace. Are we absolutely sure that there is no way to be able to improve the sensors to get that information?

How about this; JUnit, Emma, DependencyFinder sensors all have to be run on a computer, thus they all have some sort of execution location. So, even though they do not have a workspace that maps to a file, they do have a workspace that uniquely identifies the location in which the class was "sensed". How does the some sort of combination of C:\java\svn\hackyCore_Build\build\junit\hackyCore_Kernel\ and org.hackystat.core.kernel.admin.SensorProperties sound? Well.. maybe that won't work. It only solves one of my two issues.



Browser URL Problem: In your example, how would we be able to distinguish between a visit to http://java.sun.com/ for Project Foo versus Project Bar? It seems to me that the Browser URL problem is a bigger problem than the Project Data Membership Expressions can solve. So, I'm not sure this is a good example of the benefit of the project expressions.


- Can you provide any other scenarios in which the current Project definition mechanism doesn't work well, so that we can see if this approach would address the difficulties?

- One of the problems that I had in CLEW Hackystat projects, is that some developers stopped clew-hacking but I didn't want to remove them from the Hackystat project because I would "lose" their data. At the same time, I would prefer to hide their accounts (or make them inactive) some how. - Another problem with the CLEW Hackystat projects, is that as our architecture changed I created new Hackystat projects. Thus, the new project lost aggregate information like active time. So, maybe workspaces can also become inactive. - I think I remember hearing something about introducing some sort of roles to the project member. But, I can't recall what purpose the role would have. - One problem that will probably come up is Projects that consist of different programming languages. Would that affect the definitions at all?


thanks, aaron


At 07:50 PM 12/7/2005, you wrote:
Greetings, all,

I've been studiously avoiding any Big Thoughts until after 7.0 was out the door, but now the shackles have been thrown off!

So. I've been thinking about Projects. First off, let's recall that Projects are a way of defining related sets of raw sensor data in a Hackystat repository. We currently define a related set of raw sensor data with an implicit "AND" of three conditions: (1) a set of developers (who must confirm membership); (2) a time interval (within which the sensor data must have been received; and (3) a set of Workspaces (which provide a "location" for the sensor data).

Let's also recall that Workspaces serve a very honorable purpose in Hackystat: they allow groups of developers to work together on different platforms with different installations of source directories and have the system be able to tell when developers are working on the same file. There's nothing wrong with Workspaces per se.

There is, however, a problem with the way we define Projects as the AND of (1), (2), and (3). The problem is that while this approach worked fine in the beginning when we had relatively simple forms of raw sensor data, we are increasingly running into more complicated kinds of sensor data. Two examples:

(1) The famous Unit Test sensor data problem. When running unit tests from a jar or binary distribution, we no longer know the source directory that the code came from, so we no longer have a Workspace. The solution was Workspace maps, which have been found to be (a) brittle, and (b) complex. Currently, for example, someone sending Unit Test data can't get that data associated with a Project unless they run a size counter! That totally sucks.

(2) The less famous BrowserURL sensor data problem. Some folks have wanted a sensor for their browser that could record when they were looking at documentation. While one could imagine a sensor data type with "URL" as a required field, it is not at all clear how to transmogrify that into a Workspace so that the data could be associated with a Project.

In the past, we've toyed with solutions involving specifying the project name on the client side and sending it along with the raw data. That has proven to be a very bad solution. For example, it does not that sensor data to be associated with any other projects that might be defined in the future.

At an abstract level, what our current Project definition mechanism does is create a "Project Data Membership Expression" of something like the following:

(and
 (or (sensor-data-owner = "[EMAIL PROTECTED]")
     (sensor-data-owner = "[EMAIL PROTECTED]"))
 (sensor-data-start-date = "10-Nov-2005")
 (sensor-data-end-date = "undefined")
 (or (sensor-data-workspace = "hackyCore_Build")
     (sensor-data-workspace = "hackyCore_Kernel")))

Abstractly, each sensor data record in the repository is tested against that expression, and if the expression evaluates to true, then that sensor data is part of that project. Of course, we are smart about the way we "evaluate" this expression so that we don't actually traverse the entire repository!

What I'm proposing is to enhance the Project definition mechanism with the ability to define "Membership Expressions" that would enable us to indicate that a given piece of sensor data should be considered part of a Project using properties of the sensor data entry other than its owner, timestamp, and workspace. Given the right set of operators, we should be able to provide a simple, yet expressive way of associating sensor data to Project that overcomes our current problems. My idea would be to retain the current member definition approach (since we need to do the whole confirmation email routine), retain the start/end specification (since that's the nicest way to do it), make workspace selection _optional_, and then add a textarea in which someone could type in a "Project Data Membership Expression" (very similar to the "Expert" telemetry analysis mode). There is an implicit "OR" between the Workspace and PDME fields--if the sensor data satisfies the Workspace test, it's in regardless of whether it satisfies the PDME test.

So, for example, how would this approach solve the famous Unit Test sensor data problem? Well, for the case of the Hackystat project, we could supply the following expression:

(and (isSensorDataType("UnitTest"))
    (fieldStartsWith("classname", "org.hackystat")))

The syntax probably needs some work, but the basic idea is that we have an operator called "isSensorDataType" which evaluates to true if the data item is of that type, and another called "fieldStartsWith" that takes two arguments, the name of the field, and the string to match against the string.

I claim this solves the problem of Unit Tests by stating that a unit test sensor data entry is part of the Hackystat project if it contains a field (either required or optional) called "classname" and if its String value has the prefix "org.hackystat".

In the case of the Browser URL, we could supply something like the following:

(and (isSensorDataType("BrowserUrl"))
    (fieldStartsWith("url", "http://java.sun.com/";)))

Or whatever.

A final idea: with this kind of approach, it probably requires some way to get feedback on the sensor data that is 'matched' by an expression. I am imagining an analysis in which you can specify a sensor data type, and an interval, and the analysis will list all of the sensor data for that time interval with that type and for each entry, which Projects were matched against that data. This would allow you to create an expression, then run this analysis to see if the appropriate sensor data was matched against it, then edit the definition, and so forth.

So, some questions for discussion:

- Does this seem like a good idea to pursue? What issues can you see?

- Can you provide any other scenarios in which the current Project definition mechanism doesn't work well, so that we can see if this approach would address the difficulties?

Cheers,
Philip

Reply via email to