Thanks for the response, I appreciate the verbose example. On Wed, May 12, 2010 at 5:43 PM, hc busy <[email protected]> wrote: > Well, time series data is usually regular periods (the time between one > timestamp and next is always 300 seconds), so you can just divide the delta > by the constant 300 > > analysis = for each snapped generate ((double)delta / 300.0) as per_period; > > I guess what you are suggesting would work > > snapped = foreach X generate start_of_period, total_so_far; > snapped2 = foreach X generate start_of_period-300 as > start_of_previous_period, total_so_far; > diff_join = join snapped by start_of_period, snapped2 > bystart_of_previous_period; > diff = foreach diff_join generate start_of_previous_period as > start_of_period, (snapped2::total_so_far - snapped::total_so_far)/300 as > rate_of_change; > > > This seem to fit the M-R paradigm, trade efficiency for scalability. > Because you can use this to compute arbitrarily large dataset just by buying > twice the computer as you would otherwise need to compute it using an > iterator... But remember, you won't need to write *any* unit tests, > synchronization, file system, or operating system to make it happen, just > the above five lines of code.
This is probably the best way of putting this possible. Thanks for the input =). > > Does this sound right to everyone else? > > > > On Thu, May 6, 2010 at 10:13 PM, Dan Di Spaltro > <[email protected]>wrote: > >> Right now I have a pig script to rollup timeseries data, >> >> The current format of the data is in the following tab separated value >> list. >> ts service-uuid service-name type value >> >> So the first step is to take each timestamp and snap it to a period. >> For 5 min rollups I use something like this: >> snapped = FOREACH X Generate SnapTs(300, ts) .... >> >> And then I group and average and count over that group which is great >> and easy. The next bit is to show the change from 0 -> 5 min so >> basically I want to take Point A avg and subtract it from Point B avg >> and divide by the timestamps to get the rate of change between the >> points, but I am not sure how to do that. For instance, one idea I >> had was to create another dataset like this >> >> previous = FOREACH snapped GENERATE $0 + 300, .... >> >> GROUP previous BY (...), snapped BY (...) >> >> But that seems like a waste, I am just having a hard time modeling >> that. Any help would be appreciated. >> >> Best, >> >> -- >> Dan Di Spaltro >> > -- Dan Di Spaltro
