Well, time series data is usually regular periods (the time between one timestamp and next is always 300 seconds), so you can just divide the delta by the constant 300
analysis = for each snapped generate ((double)delta / 300.0) as per_period; I guess what you are suggesting would work snapped = foreach X generate start_of_period, total_so_far; snapped2 = foreach X generate start_of_period-300 as start_of_previous_period, total_so_far; diff_join = join snapped by start_of_period, snapped2 bystart_of_previous_period; diff = foreach diff_join generate start_of_previous_period as start_of_period, (snapped2::total_so_far - snapped::total_so_far)/300 as rate_of_change; This seem to fit the M-R paradigm, trade efficiency for scalability. Because you can use this to compute arbitrarily large dataset just by buying twice the computer as you would otherwise need to compute it using an iterator... But remember, you won't need to write *any* unit tests, synchronization, file system, or operating system to make it happen, just the above five lines of code. Does this sound right to everyone else? On Thu, May 6, 2010 at 10:13 PM, Dan Di Spaltro <[email protected]>wrote: > Right now I have a pig script to rollup timeseries data, > > The current format of the data is in the following tab separated value > list. > ts service-uuid service-name type value > > So the first step is to take each timestamp and snap it to a period. > For 5 min rollups I use something like this: > snapped = FOREACH X Generate SnapTs(300, ts) .... > > And then I group and average and count over that group which is great > and easy. The next bit is to show the change from 0 -> 5 min so > basically I want to take Point A avg and subtract it from Point B avg > and divide by the timestamps to get the rate of change between the > points, but I am not sure how to do that. For instance, one idea I > had was to create another dataset like this > > previous = FOREACH snapped GENERATE $0 + 300, .... > > GROUP previous BY (...), snapped BY (...) > > But that seems like a waste, I am just having a hard time modeling > that. Any help would be appreciated. > > Best, > > -- > Dan Di Spaltro >
