Thanks for the response,  I appreciate the verbose example.

On Wed, May 12, 2010 at 5:43 PM, hc busy <[email protected]> wrote:
> Well, time series data is usually regular periods (the time between one
> timestamp and next is always 300 seconds), so you can just divide the delta
> by the constant 300
>
> analysis = for each snapped generate ((double)delta / 300.0) as per_period;
>
> I guess what you are suggesting would work
>
> snapped = foreach X generate start_of_period, total_so_far;
> snapped2 = foreach X generate start_of_period-300 as
> start_of_previous_period, total_so_far;
> diff_join = join snapped by start_of_period, snapped2
> bystart_of_previous_period;
> diff = foreach diff_join generate start_of_previous_period as
> start_of_period, (snapped2::total_so_far - snapped::total_so_far)/300 as
> rate_of_change;
>
>
> This seem to fit the M-R paradigm, trade efficiency for  scalability.
> Because you can use this to compute arbitrarily large dataset just by buying
> twice the computer as you would otherwise need to compute it using an
> iterator... But remember, you won't need to write *any* unit tests,
> synchronization, file system, or operating system to make it happen, just
> the above five lines of code.

This is probably the best way of putting this possible.  Thanks for
the input =).

>
> Does this sound right to everyone else?
>
>
>
> On Thu, May 6, 2010 at 10:13 PM, Dan Di Spaltro 
> <[email protected]>wrote:
>
>> Right now I have a pig script to rollup timeseries data,
>>
>> The current format of the data is in the following tab separated value
>> list.
>> ts service-uuid service-name type value
>>
>> So the first step is to take each timestamp and snap it to a period.
>> For 5 min rollups I use something like this:
>> snapped = FOREACH X Generate SnapTs(300, ts) ....
>>
>> And then I group and average and count over that group which is great
>> and easy.  The next bit is to show the change from 0 -> 5 min  so
>> basically I want to take Point A avg and subtract it from Point B avg
>> and divide by the timestamps to get the rate of change between the
>> points, but I am not sure how to do that.  For instance, one idea I
>> had was to create another dataset like this
>>
>> previous = FOREACH snapped GENERATE $0 + 300, ....
>>
>> GROUP previous BY (...), snapped BY (...)
>>
>> But that seems like a waste, I am just having a hard time modeling
>> that.  Any help would be appreciated.
>>
>> Best,
>>
>> --
>> Dan Di Spaltro
>>
>



-- 
Dan Di Spaltro

Reply via email to