Re: [HACKYSTAT-DEV-L] RFC: Project data membership expressions

Aaron Kagawa Thu, 08 Dec 2005 02:19:34 -0800

- Does this seem like a good idea to pursue?

Seems interesting... I definitely think that the project definitions can beimproved.

The problems that you identified are problems that we have to solve.Although, as I was thinking about this solution, I'm not totally sureaddressing these problems at the project definition level is the bestpossible solution.

What issues can you see?


UnitTest (Coverage, Dependency) Problem:

Wouldn't we lose the ability to find the number of successful unit testsfor a top level module? Without a WorkspaceMap we wouldn't be able todetermine if the metrics fromorg.hackystat.core.kernel.admin.ServerProperties andhackyCore_Kernel/src/org/hackystat/core/kernel/admin/ServerProperties.javaare actually the same.

In addition, imagine that we run the JUnit sensor on all of our Hackystatconfigurations (Hackystat-Standard, Hackystat-All, Hackystat-<somethingelse>), then we could have many duplicate UnitTest sensor data entriesassociated with our Hackystat project if we follow the proposedisSensorDataType / fieldStartsWith("org.hackystat") example. This problemgets a lot tricker if we are dealing with Coverage information, since it isa Snapshot of the system. Currently, if we actually did run the JUnitsensor on all of our configurations, then we would simply not set theworkspace root for the configurations that we don't want to analyze. I likethis solution because, we can send all the data we want from the otherconfigurations and not affect the validity of the project analyses. Whoknows maybe later on we would want a project for a specific configuration.

Are we certain that this higher level project solution is the rightsolution for the lower level sensor data problem? In other words, couldthere possibly be another general solution at the sensor data level orsensor level. Stepping back a little... The best possible solution would bethat the UnitTest, Coverage, and Dependency SDTs some how supply aworkspace. Are we absolutely sure that there is no way to be able toimprove the sensors to get that information?

How about this; JUnit, Emma, DependencyFinder sensors all have to be run ona computer, thus they all have some sort of execution location. So, eventhough they do not have a workspace that maps to a file, they do have aworkspace that uniquely identifies the location in which the class was"sensed". How does the some sort of combination ofC:\java\svn\hackyCore_Build\build\junit\hackyCore_Kernel\ andorg.hackystat.core.kernel.admin.SensorProperties sound? Well.. maybe thatwon't work. It only solves one of my two issues.

Browser URL Problem: In your example, how would we be able to distinguishbetween a visit to http://java.sun.com/ for Project Foo versus Project Bar?It seems to me that the Browser URL problem is a bigger problem than theProject Data Membership Expressions can solve. So, I'm not sure this is agood example of the benefit of the project expressions.

- Can you provide any other scenarios in which the current Projectdefinition mechanism doesn't work well, so that we can see if thisapproach would address the difficulties?

- One of the problems that I had in CLEW Hackystat projects, is that somedevelopers stopped clew-hacking but I didn't want to remove them from theHackystat project because I would "lose" their data. At the same time, Iwould prefer to hide their accounts (or make them inactive) some how.- Another problem with the CLEW Hackystat projects, is that as ourarchitecture changed I created new Hackystat projects. Thus, the newproject lost aggregate information like active time. So, maybe workspacescan also become inactive.- I think I remember hearing something about introducing some sort of rolesto the project member. But, I can't recall what purpose the role would have.- One problem that will probably come up is Projects that consist ofdifferent programming languages. Would that affect the definitions at all?



thanks, aaron


At 07:50 PM 12/7/2005, you wrote:

Greetings, all,
I've been studiously avoiding any Big Thoughts until after 7.0 was out thedoor, but now the shackles have been thrown off!
So. I've been thinking about Projects. First off, let's recall thatProjects are a way of defining related sets of raw sensor data in aHackystat repository. We currently define a related set of raw sensordata with an implicit "AND" of three conditions: (1) a set of developers(who must confirm membership); (2) a time interval (within which thesensor data must have been received; and (3) a set of Workspaces (whichprovide a "location" for the sensor data).
Let's also recall that Workspaces serve a very honorable purpose inHackystat: they allow groups of developers to work together on differentplatforms with different installations of source directories and have thesystem be able to tell when developers are working on the samefile. There's nothing wrong with Workspaces per se.
There is, however, a problem with the way we define Projects as the AND of(1), (2), and (3). The problem is that while this approach worked fine inthe beginning when we had relatively simple forms of raw sensor data, weare increasingly running into more complicated kinds of sensor data. Twoexamples:
(1) The famous Unit Test sensor data problem. When running unit tests froma jar or binary distribution, we no longer know the source directory thatthe code came from, so we no longer have a Workspace. The solution wasWorkspace maps, which have been found to be (a) brittle, and (b)complex. Currently, for example, someone sending Unit Test data can't getthat data associated with a Project unless they run a size counter! Thattotally sucks.
(2) The less famous BrowserURL sensor data problem. Some folks havewanted a sensor for their browser that could record when they were lookingat documentation. While one could imagine a sensor data type with "URL" asa required field, it is not at all clear how to transmogrify that into aWorkspace so that the data could be associated with a Project.
In the past, we've toyed with solutions involving specifying the projectname on the client side and sending it along with the raw data. That hasproven to be a very bad solution. For example, it does not that sensordata to be associated with any other projects that might be defined in thefuture.
At an abstract level, what our current Project definition mechanism doesis create a "Project Data Membership Expression" of something like thefollowing:
(and
 (or (sensor-data-owner = "[EMAIL PROTECTED]")
     (sensor-data-owner = "[EMAIL PROTECTED]"))
 (sensor-data-start-date = "10-Nov-2005")
 (sensor-data-end-date = "undefined")
 (or (sensor-data-workspace = "hackyCore_Build")
     (sensor-data-workspace = "hackyCore_Kernel")))
Abstractly, each sensor data record in the repository is tested againstthat expression, and if the expression evaluates to true, then that sensordata is part of that project. Of course, we are smart about the way we"evaluate" this expression so that we don't actually traverse the entirerepository!
What I'm proposing is to enhance the Project definition mechanism with theability to define "Membership Expressions" that would enable us toindicate that a given piece of sensor data should be considered part of aProject using properties of the sensor data entry other than its owner,timestamp, and workspace. Given the right set of operators, we should beable to provide a simple, yet expressive way of associating sensor data toProject that overcomes our current problems. My idea would be to retainthe current member definition approach (since we need to do the wholeconfirmation email routine), retain the start/end specification (sincethat's the nicest way to do it), make workspace selection _optional_, andthen add a textarea in which someone could type in a "Project DataMembership Expression" (very similar to the "Expert" telemetry analysismode). There is an implicit "OR" between the Workspace and PDME fields--ifthe sensor data satisfies the Workspace test, it's in regardless ofwhether it satisfies the PDME test.
So, for example, how would this approach solve the famous Unit Test sensordata problem? Well, for the case of the Hackystat project, we could supplythe following expression:
(and (isSensorDataType("UnitTest"))
    (fieldStartsWith("classname", "org.hackystat")))
The syntax probably needs some work, but the basic idea is that we have anoperator called "isSensorDataType" which evaluates to true if the dataitem is of that type, and another called "fieldStartsWith" that takes twoarguments, the name of the field, and the string to match against the string.
I claim this solves the problem of Unit Tests by stating that a unit testsensor data entry is part of the Hackystat project if it contains a field(either required or optional) called "classname" and if its String valuehas the prefix "org.hackystat".
In the case of the Browser URL, we could supply something like the following:

(and (isSensorDataType("BrowserUrl"))
    (fieldStartsWith("url", "http://java.sun.com/";)))

Or whatever.
A final idea: with this kind of approach, it probably requires some way toget feedback on the sensor data that is 'matched' by an expression. I amimagining an analysis in which you can specify a sensor data type, and aninterval, and the analysis will list all of the sensor data for that timeinterval with that type and for each entry, which Projects were matchedagainst that data. This would allow you to create an expression, then runthis analysis to see if the appropriate sensor data was matched againstit, then edit the definition, and so forth.
So, some questions for discussion:

- Does this seem like a good idea to pursue? What issues can you see?
- Can you provide any other scenarios in which the current Projectdefinition mechanism doesn't work well, so that we can see if thisapproach would address the difficulties?
Cheers,
Philip

Re: [HACKYSTAT-DEV-L] RFC: Project data membership expressions

Reply via email to