Re: Rollup time series data

hc busy Wed, 12 May 2010 17:43:53 -0700

Well, time series data is usually regular periods (the time between one
timestamp and next is always 300 seconds), so you can just divide the delta
by the constant 300


analysis = for each snapped generate ((double)delta / 300.0) as per_period;

I guess what you are suggesting would work

snapped = foreach X generate start_of_period, total_so_far;
snapped2 = foreach X generate start_of_period-300 as
start_of_previous_period, total_so_far;
diff_join = join snapped by start_of_period, snapped2
bystart_of_previous_period;
diff = foreach diff_join generate start_of_previous_period as
start_of_period, (snapped2::total_so_far - snapped::total_so_far)/300 as
rate_of_change;


This seem to fit the M-R paradigm, trade efficiency for  scalability.
Because you can use this to compute arbitrarily large dataset just by buying
twice the computer as you would otherwise need to compute it using an
iterator... But remember, you won't need to write *any* unit tests,
synchronization, file system, or operating system to make it happen, just
the above five lines of code.

Does this sound right to everyone else?



On Thu, May 6, 2010 at 10:13 PM, Dan Di Spaltro <[email protected]>wrote:

> Right now I have a pig script to rollup timeseries data,
>
> The current format of the data is in the following tab separated value
> list.
> ts service-uuid service-name type value
>
> So the first step is to take each timestamp and snap it to a period.
> For 5 min rollups I use something like this:
> snapped = FOREACH X Generate SnapTs(300, ts) ....
>
> And then I group and average and count over that group which is great
> and easy.  The next bit is to show the change from 0 -> 5 min  so
> basically I want to take Point A avg and subtract it from Point B avg
> and divide by the timestamps to get the rate of change between the
> points, but I am not sure how to do that.  For instance, one idea I
> had was to create another dataset like this
>
> previous = FOREACH snapped GENERATE $0 + 300, ....
>
> GROUP previous BY (...), snapped BY (...)
>
> But that seems like a waste, I am just having a hard time modeling
> that.  Any help would be appreciated.
>
> Best,
>
> --
> Dan Di Spaltro
>

Re: Rollup time series data

Reply via email to