I don't quite follow the "listener" approach. Perhaps spending some time
taking about it in the meeting would be a good idea. But I am convinced
that merely using a DB without major architectural change won't solve
our problem. -Cedric
Philip Johnson wrote:
Hi Cedric,
Why we have to put them together? Why not having two types of daily
project objects, one for telemetry style analysis, one for daily project
details style analysis. My 2 cents.
This is an interesting question. We should look into this further,
and figure out whether:
(a) DailyProjectDetails and Telemetry have no way in principle to
share caching mechanisms, and the code would be clearer as two
distinct classes.
(b) The current separation is an artifact of different people working
on this code at different times. A redesign of this class might show
opportunities to share code/infrastructure
(c) Our recent performance issues might indicate that we need to
revisit our overall approach to caching in order to avoid looping
"silos" that result in the same raw data being revisited over and
over. Such a redesign of our analysis approach might change things
such that what is now (a) becomes (b), or what is now (b) becomes (a)!
I have actually been contemplating a rather radical thought: what if
the dailyprojectdata objects define "listeners" that are passed a
sensor data instance, and instead of repeatedly looping through the
sensor data for a day, the caching infrastructure instead does exactly
one pass through the sensor data, calling each defined "listener" on
the sensor data in turn? Then, an analysis like dailyprojectsummary
would loop through all defined dailyprojectdata objects, create a list
of the relevent listeners, and then do one pass through the sensor
data for the day.
This would work well if we had a "fast" way of doing workspace
comparison, and if each sensor data instance cached the workspace
instance associated with its path. I think. :-) There are probably
other issues I haven't realized yet.
As always, I have been wondering if the performance problem has to do
with not having a fast backend relational database for storing the
sensor data. It doesn't look like such a change would help us here,
since the slowdown seems related to workspace computation and
comparison, which is processing that occurs after the sensor data is
retrieved from the repository, and/or processing algorithms that are
exponential in the number of top-level workspaces.
I've also been wondering whether there is something intrinsically
wrong in our conceptualization of workspaces or projects---are there
alternative ways to organize the data that would achieve the same ends
without incurring these kinds of problems? So far, I haven't been able
to come up with anything obviously better.
To put this in perspective, we've been increasing the functionality
and expressiveness of the system for around two years now with
relatively little effort put into performance issues. Currently,
we're now dealing with a single project that generates tens of
thousands of sensor data instances per day, and performing many very
different kinds of analyses on that data stream. It's not
unreasonable that we are now discovering that some of our "simplistic"
implementations are not scaling well. I am hopeful that, just as
Hongbing identified and removed a problem in DailyProjectUnitTest in
three days of work, we can work together to think through the broader
issues in a relatively short period of time.
Comments?
Cheers,
Philip