Hey Guys,

I had another one of those nights of thinking.. These comments are spread out to a couple of domains from telemetry to the jira sensor. Anyway, here it goes.
------------------------------------------------------------------------------------


Telemetry - need to telemetry streams into context
Today, we were looking at spikes in telemetry streams. Just noticing spikes are a great accomplishment, however I've learned at JPL that we need to understand the causes in order to introduce changes. This was illustrated pretty well today, when Philip was trying to remember if a spike in Unit Test Active Time should be associated with his work in Hackystat Test Case Framework.


I believe context is important for developers and managers that don't know everything that is going on or even can't remember. If we are lucky, there is one "expert" in a project that knows about everything that goes on in project development. He/She can analyze problems and know the specific actions that caused these problems. However, the rest of us don't know what exactly everyone else is doing and when we see spikes in telemetry we have no idea what they mean. I would claim that this a major problem with our software telemetry at this point; we see spikes, declines, or even steadiness but have no idea why it is occurring.

There are several ways to introduce context.. (1) using other streams to explain spikes. For example, Cedric showed us one hypothesis that the number of developers working in a given module will cause the modules coverage to go down. The number of developers is the context for coverage. It explains the situation. However, we ran into problems today because the number of developers had no context. We were wondering what those developers were doing. So the second way could be (2) adding "contextual" streams like, tool tips for commit messages, textual labels, or issues of defects and enhancements.

For example, when I was at JPL I would show telemetry streams to JPL personnel working on MDS. They would see spike, declines, and steady areas all over the place. None of them could remember why those spikes happened and therefore we had little knowledge of what they meant. Luckily, I was able to bug a couple people that I could gather bits and pieces of information to understand why they occurred and hypothesize what they meant.

Bottom Line: we need to remove the guess work required to understand the spikes, declines, and steadiness of a single stream before we can be able to understand how streams relate to one another.
------------------------------------------------------------------------------------


Telemetry - we need to know when there are problems - red flags
Currently, we are guessing whether things are going good or bad based on telemetry streams. But, we have these good or bad indicators in Build Results and Issue Defects. I would claim that we need to use Build Results and Defects to indicate "Red Flags" in the telemetry streams. Each Red Flag in a stream or scene would indicate a place to look for spikes, declines, or steady spots. This would be an improvement over our current process of just looking for spikes, declines or steadiness because who knows if they actually relate to something significant. Red Flags flip the process of 'looking for changes leads to something interesting' to 'something interesting (aka red flags) will help us understand interesting changes'.


For example, if we see a Build Failure, then we can focus our attention to a time frame and telemetry streams that could indicate why that Build Failure happened. This is another example of putting the telemetry streams into a context that we can understand. Reported Defects can have a similar effect.

I don't quite remember how the Build Sensor works, but it seems to me that we would need to know what workspace caused the build failure and the type of build failure. For example, a build failure could be caused by a checkstyle error in hackyStdExt or a junit failure in hackyKernel. If we had information like this then we can narrow down our guessing to what caused the problems.

Another type of Red Flag are threshold horizontal lines. If a spike or decline exceeds some threshold that indicates something interesting.. We can then focus our attention to areas before and after that red flag. What would be interesting to know would be the changes that caused that red flag to occur and what reverse changes occurred to make that red flag go away. Other horizontal lines like the yearly average or monthly average could also provide information that puts the values into context.

Bottom line: I believe people will be more interested in Telemetry when things are going bad or there is a prediction of bad things to come (we haven't started on the predictive part yet). So, we need provide users with information about measurable bad things which would be defects, rework, build failures, etc..
------------------------------------------------------------------------------------


Telemetry Design - we need help from Information Architects or HCI people
Wow.. I thought I'd never say this, but we could use the help of HCI and IA experts to help us display information effectively. Consider this, when I first look at a new telemetry stream it takes me a little while to figure out what are all those lines (I have to figure that out before I can start to think about what they mean). I would claim that there must be better ways of presenting this information.... Anyway that was just a thought.


Here are a couple of ideas I just thought of:
- High, Low and Steady values should be easily detectable without having to really look hard.
- Comparing streams should be easy.. The whole point of the telemetry wall is to be able to compare streams effectively.. But, I think it is currently hard to do so. One example is trying to compare a stream in screen 1 with a stream in screen 9.
- Place the knowledge required to understand the telemetry streams in the telemetry streams. Currently, we require our users (mainly just us) to have the knowledge to understand the streams in our heads (this comes from Don Norman - knowledge in the world versus knowledge in the head) This goes back to context, but also labeling is an issue.
------------------------------------------------------------------------------------


JIRA Sensor improvement - what to do with old data
I've just realized that the JIRA sensor will send data to Hackystat only from the point it was installed. Meaning that old issues will not be sent to Hackystat. This is a problem for projects who have years worth of JIRA data before installing Hackystat and the JIRA Hackystat sensor. Someone would be required to "Update" all the entries to send that data to Hackystat for processing. I would say that is a Bad thing, but I'm not sure how we would fix this.


Similarly, the Jupiter Review Eclipse Plugin Sensor uses that same model. In fact, we have conducted about 4 reviews without the Jupiter sensor working and that data will be in "Hackystat limbo" unless someone reprocesses them. In an much earlier email I stated that this could be a potential problem and suggest that an ant sensor send these review issues off to the sensor, thus ensuring that all issues are accounted for in Hackystat. Although, this would be harder, I feel that the Jupiter Sensor should collect Review Activity and the Ant Sensor should collect metrics about the actual Review Issue. Again, I would claim that IDE based sensor should just collect activity type sensor data and Ant Sensors should collect product type data. This model seems to work well for Activity and FileMetrics.

Ok.. I know your thinking that this proposal will make it harder for people who don't use Ant.. Could we do both? Or just have that option?

------------------------------------------------------------------------------------

JIRA Sensor
Burt, did you find a "shutdown" event handler? Which should just execute send on all sensor shells one last time.



comments?

thanks, aaron




Reply via email to