Re: [HACKYSTAT-DEV-L] RFC: Project data membership expressions

Philip Johnson Thu, 08 Dec 2005 12:25:34 -0800

Hi Aaron,

Excellent thoughts, as usual!

If I understand you correctly, you are bringing up a related issue in Projectrepresentation, which can be thought of as the "precision of location" with respect tothe sensor data. This is not exactly the same as the issue I was addressing, which isthe task of "associating" sensor data with (one or more) Projects.

To clarify the distinction, let's say that someone edits line 33 of the filec:\svn\hackyCore_Build\build.xml.There are the following basic levels of "precision of location" that we could speak ofwith respect to that action:


1. An Edit event occurred in Project Hackystat-7
2. An Edit event occurred in Project Hackystat-7, Workspace hackyCore_Build

3. An Edit event occurred in Project Hackystat-7, Workspace hackyCore_Build, Filebuild.xml4. An Edit event occurred in Project Hackystat-7, Workspace hackyCore_Build, Filebuild.xml, Line 33


When I look at it this way, I see the following:

- Our sensors generally collect data at Level 3 in terms of precision of 
location.

- Our analyses generally produce information at Levels 1 and 2 in terms of precision oflocation.- My Project Membership proposal would enable the design of sensors that collect data atLevel 1, and thus would limit analyses on such data to Level 1.- My Project Membership proposal would require current analyses, which tend to bedesigned on the assumption that all sensor data has Level 2 precision of location, tocheck the precision of location of sensor data, since some sensor data might not have theprecision required for a particular analysis.- There is no a priori reason why we couldn't just as well demand higher precisions insensors and exploit them in analyses---for example, telemetry over individual files oreven telemetry over individual lines in individual files. So far, we've simply operatedunder the assumption that these kinds of analyses were "too fine grained" to be useful.

So, you are absolutely right in saying that Project Data Membership Expressions don'tcompletely "solve" the Unit Test problem. What PDMEs do is enable us to implement sensorsand sensor data types with less precision, with the belief that this data can still beuseful even though analyses can only operate at that reduced level of precision. In thecase of Unit Tests, it would allow us to use telemetry streams that track unit tests atLevel 1 of precision. While we lose the ability to track unit tests at Level 2, we gainthe ability to not require FileMetric data and the associated mappings and so forth.

I would recast the BrowserUrl problem in the same way in the light of your comments.Right now, we can't do anything with browser data. PDMEs would allow Level 1 style ofanalyses relatively easily. To get more fine-grained analyses, such as those you bringup, one would of course require more complexity, such as some kind of mapping mechanism.

Cam has previously commented on the desire to analyze different branches of his sourcecode independently. When you start bringing configuration management in, it adds yetanother "dimension" of "precision":


1. An Edit event occurred in Project Hackystat-7 [branch foo]
2. An Edit event occurred in Project Hackystat-7, Workspace hackyCore_Build 
[branch foo]

3. An Edit event occurred in Project Hackystat-7, Workspace hackyCore_Build, Filebuild.xml [branch foo]4. An Edit event occurred in Project Hackystat-7, Workspace hackyCore_Build, Filebuild.xml, Line 33 [branch foo]


Comments?

Cheers,
Philip


--On Thursday, December 08, 2005 12:21 AM -1000 Aaron Kagawa <[EMAIL 
PROTECTED]> wrote:

- Does this seem like a good idea to pursue?


Seems interesting... I definitely think that the project definitions can be 
improved.

The problems that you identified are problems that we have to solve. Although, 
as I was
thinking about this solution, I'm not totally sure addressing these problems at 
the
project definition level is the best possible solution.

What issues can you see?


UnitTest (Coverage, Dependency) Problem:
Wouldn't we lose the ability to find the number of successful unit tests for a 
top
level module? Without a WorkspaceMap we wouldn't be able to determine if the 
metrics
from org.hackystat.core.kernel.admin.ServerProperties and
hackyCore_Kernel/src/org/hackystat/core/kernel/admin/ServerProperties.java are 
actually
the same.

In addition, imagine that we run the JUnit sensor on all of our Hackystat
configurations (Hackystat-Standard, Hackystat-All, Hackystat-<something else>), 
then we
could have many duplicate UnitTest sensor data entries associated with our 
Hackystat
project if we follow the proposed isSensorDataType / 
fieldStartsWith("org.hackystat")
example. This problem gets a lot tricker if we are dealing with Coverage 
information,
since it is a Snapshot of the system. Currently, if we actually did run the 
JUnit
sensor on all of our configurations, then we would simply not set the workspace 
root
for the configurations that we don't want to analyze. I like this solution 
because, we
can send all the data we want from the other configurations and not affect the 
validity
of the project analyses.  Who knows maybe later on we would want a project for a
specific configuration.

Are we certain that this higher level project solution is the right solution 
for the
lower level sensor data problem? In other words, could there possibly be another
general solution at the sensor data level or sensor level. Stepping back a 
little...
The best possible solution would be that the UnitTest, Coverage, and Dependency 
SDTs
some how supply a workspace.  Are we absolutely sure that there is no way to be 
able to
improve the sensors to get that information?

How about this; JUnit, Emma, DependencyFinder sensors all have to be run on a 
computer,
thus they all have some sort of execution location. So, even though they do not 
have a
workspace that maps to a file, they do have a workspace that uniquely 
identifies the
location in which the class was "sensed".  How does the some sort of 
combination of
C:\java\svn\hackyCore_Build\build\junit\hackyCore_Kernel\ and
org.hackystat.core.kernel.admin.SensorProperties sound?  Well.. maybe that 
won't work.
It only solves one of my two issues.



Browser URL Problem: In your example, how would we be able to distinguish 
between a
visit to http://java.sun.com/ for Project Foo versus Project Bar? It seems to 
me that
the Browser URL problem is a bigger problem than the Project Data Membership
Expressions can solve. So, I'm not sure this is a good example of the benefit 
of the
project expressions.

- Can you provide any other scenarios in which the current Project
definition mechanism doesn't work well, so that we can see if this
approach would address the difficulties?


- One of the problems that I had in CLEW Hackystat projects, is that some 
developers
stopped clew-hacking but I didn't want to remove them from the Hackystat project
because I would "lose" their data. At the same time, I would prefer to hide 
their
accounts (or make them inactive) some how.
- Another problem with the CLEW Hackystat projects, is that as our architecture 
changed
I created new Hackystat projects. Thus, the new project lost aggregate 
information like
active time. So, maybe workspaces can also become inactive.
- I think I remember hearing something about introducing some sort of roles to 
the
project member. But, I can't recall what purpose the role would have.
- One problem that will probably come up is Projects that consist of different
programming languages. Would that affect the definitions at all?


thanks, aaron


At 07:50 PM 12/7/2005, you wrote:

Greetings, all,

I've been studiously avoiding any Big Thoughts until after 7.0 was out the
door, but now the shackles have been thrown off!

So. I've been thinking about Projects.  First off, let's recall that
Projects are a way of defining related sets of raw sensor data in a
Hackystat repository.  We currently define a related set of raw sensor
data with an implicit "AND" of three conditions: (1) a set of developers
(who must confirm membership); (2) a time interval (within which the
sensor data must have been received; and (3) a set of Workspaces (which
provide a "location" for the sensor data).

Let's also recall that Workspaces serve a very honorable purpose in
Hackystat: they allow groups of developers to work together on different
platforms with different installations of source directories and have the
system be able to tell when developers are working on the same
file.  There's nothing wrong with Workspaces per se.

There is, however, a problem with the way we define Projects as the AND of
(1), (2), and (3).  The problem is that while this approach worked fine in
the beginning when we had relatively simple forms of raw sensor data, we
are increasingly running into more complicated kinds of sensor data.  Two
examples:

(1) The famous Unit Test sensor data problem. When running unit tests from
a jar or binary distribution, we no longer know the source directory that
the code came from, so we no longer have a Workspace.  The solution was
Workspace maps, which have been found to be (a) brittle, and (b)
complex.  Currently, for example, someone sending Unit Test data can't get
that data associated with a Project unless they run a size counter!  That
totally sucks.

(2) The less famous BrowserURL sensor data problem.  Some folks have
wanted a sensor for their browser that could record when they were looking
at documentation. While one could imagine a sensor data type with "URL" as
a required field, it is not at all clear how to transmogrify that into a
Workspace so that the data could be associated with a Project.

In the past, we've toyed with solutions involving specifying the project
name on the client side and sending it along with the raw data.  That has
proven to be a very bad solution.  For example, it does not that sensor
data to be associated with any other projects that might be defined in the
future.

At an abstract level, what our current Project definition mechanism does
is create a "Project Data Membership Expression" of something like the
following:

(and
 (or (sensor-data-owner = "[EMAIL PROTECTED]")
     (sensor-data-owner = "[EMAIL PROTECTED]"))
 (sensor-data-start-date = "10-Nov-2005")
 (sensor-data-end-date = "undefined")
 (or (sensor-data-workspace = "hackyCore_Build")
     (sensor-data-workspace = "hackyCore_Kernel")))

Abstractly, each sensor data record in the repository is tested against
that expression, and if the expression evaluates to true, then that sensor
data is part of that project. Of course, we are smart about the way we
"evaluate" this expression so that we don't actually traverse the entire
repository!

What I'm proposing is to enhance the Project definition mechanism with the
ability to define "Membership Expressions" that would enable us to
indicate that a given piece of sensor data should be considered part of a
Project using properties of the sensor data entry other than its owner,
timestamp, and workspace.  Given the right set of operators, we should be
able to provide a simple, yet expressive way of associating sensor data to
Project that overcomes our current problems.  My idea would be to retain
the current member definition approach (since we need to do the whole
confirmation email routine), retain the start/end specification (since
that's the nicest way to do it), make workspace selection _optional_, and
then add a textarea in which someone could type in a "Project Data
Membership Expression" (very similar to the "Expert" telemetry analysis
mode). There is an implicit "OR" between the Workspace and PDME fields--if
the sensor data satisfies the Workspace test, it's in regardless of
whether it satisfies the PDME test.

So, for example, how would this approach solve the famous Unit Test sensor
data problem? Well, for the case of the Hackystat project, we could supply
the following expression:

(and (isSensorDataType("UnitTest"))
    (fieldStartsWith("classname", "org.hackystat")))

The syntax probably needs some work, but the basic idea is that we have an
operator called "isSensorDataType" which evaluates to true if the data
item is of that type, and another called "fieldStartsWith" that takes two
arguments, the name of the field, and the string to match against the string.

I claim this solves the problem of Unit Tests by stating that a unit test
sensor data entry is part of the Hackystat project if it contains a field
(either required or optional) called "classname" and if its String value
has the prefix "org.hackystat".

In the case of the Browser URL, we could supply something like the following:

(and (isSensorDataType("BrowserUrl"))
    (fieldStartsWith("url", "http://java.sun.com/";)))

Or whatever.

A final idea: with this kind of approach, it probably requires some way to
get feedback on the sensor data that is 'matched' by an expression.  I am
imagining an analysis in which you can specify a sensor data type, and an
interval, and the analysis will list all of the sensor data for that time
interval with that type and for each entry, which Projects were matched
against that data.  This would allow you to create an expression, then run
this analysis to see if the appropriate sensor data was matched against
it, then edit the definition, and so forth.

So, some questions for discussion:

- Does this seem like a good idea to pursue? What issues can you see?

- Can you provide any other scenarios in which the current Project
definition mechanism doesn't work well, so that we can see if this
approach would address the difficulties?

Cheers,
Philip

Re: [HACKYSTAT-DEV-L] RFC: Project data membership expressions

Reply via email to