Hi Cedric,

Why we have to put them together? Why not having two types of daily
project objects, one for telemetry style analysis, one for daily project
details style analysis. My 2 cents.

This is an interesting question. We should look into this further, and figure out whether:

(a) DailyProjectDetails and Telemetry have no way in principle to share caching mechanisms, and the code would be clearer as two distinct classes.

(b) The current separation is an artifact of different people working on this code at different times. A redesign of this class might show opportunities to share code/infrastructure

(c) Our recent performance issues might indicate that we need to revisit our overall approach to caching in order to avoid looping "silos" that result in the same raw data being revisited over and over. Such a redesign of our analysis approach might change things such that what is now (a) becomes (b), or what is now (b) becomes (a)!

I have actually been contemplating a rather radical thought: what if the dailyprojectdata objects define "listeners" that are passed a sensor data instance, and instead of repeatedly looping through the sensor data for a day, the caching infrastructure instead does exactly one pass through the sensor data, calling each defined "listener" on the sensor data in turn? Then, an analysis like dailyprojectsummary would loop through all defined dailyprojectdata objects, create a list of the relevent listeners, and then do one pass through the sensor data for the day.

This would work well if we had a "fast" way of doing workspace comparison, and if each sensor data instance cached the workspace instance associated with its path. I think. :-) There are probably other issues I haven't realized yet.

As always, I have been wondering if the performance problem has to do with not having a fast backend relational database for storing the sensor data. It doesn't look like such a change would help us here, since the slowdown seems related to workspace computation and comparison, which is processing that occurs after the sensor data is retrieved from the repository, and/or processing algorithms that are exponential in the number of top-level workspaces.

I've also been wondering whether there is something intrinsically wrong in our conceptualization of workspaces or projects---are there alternative ways to organize the data that would achieve the same ends without incurring these kinds of problems? So far, I haven't been able to come up with anything obviously better.

To put this in perspective, we've been increasing the functionality and expressiveness of the system for around two years now with relatively little effort put into performance issues. Currently, we're now dealing with a single project that generates tens of thousands of sensor data instances per day, and performing many very different kinds of analyses on that data stream. It's not unreasonable that we are now discovering that some of our "simplistic" implementations are not scaling well. I am hopeful that, just as Hongbing identified and removed a problem in DailyProjectUnitTest in three days of work, we can work together to think through the broader issues in a relatively short period of time.

Comments?

Cheers,
Philip

Reply via email to