Re: Rollup time series data

Russell Jurney Fri, 14 May 2010 14:20:31 -0700

The DateTime UDFs in PiggyBank may be helpful.  See
http://github.com/apache/pig/tree/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/datetime/truncate/


They feature date truncation, which can help to group logs into whole time
units: day/hour/minute/second, etc.  It sounds like a proper rounding
function, by datetime unit (day, hour, minute, etc.) with a number of splits
would be a good addition.  I'll try and get that in there if people think
its a good idea?

I've been meaning to add DateTime support to Pig, but don't have time in the
near future to do so :/

Russ

On Fri, May 14, 2010 at 9:44 AM, Dan Di Spaltro <[email protected]>wrote:

> Thanks for the response,  I appreciate the verbose example.
>
> On Wed, May 12, 2010 at 5:43 PM, hc busy <[email protected]> wrote:
> > Well, time series data is usually regular periods (the time between one
> > timestamp and next is always 300 seconds), so you can just divide the
> delta
> > by the constant 300
> >
> > analysis = for each snapped generate ((double)delta / 300.0) as
> per_period;
> >
> > I guess what you are suggesting would work
> >
> > snapped = foreach X generate start_of_period, total_so_far;
> > snapped2 = foreach X generate start_of_period-300 as
> > start_of_previous_period, total_so_far;
> > diff_join = join snapped by start_of_period, snapped2
> > bystart_of_previous_period;
> > diff = foreach diff_join generate start_of_previous_period as
> > start_of_period, (snapped2::total_so_far - snapped::total_so_far)/300 as
> > rate_of_change;
> >
> >
> > This seem to fit the M-R paradigm, trade efficiency for  scalability.
> > Because you can use this to compute arbitrarily large dataset just by
> buying
> > twice the computer as you would otherwise need to compute it using an
> > iterator... But remember, you won't need to write *any* unit tests,
> > synchronization, file system, or operating system to make it happen, just
> > the above five lines of code.
>
> This is probably the best way of putting this possible.  Thanks for
> the input =).
>
> >
> > Does this sound right to everyone else?
> >
> >
> >
> > On Thu, May 6, 2010 at 10:13 PM, Dan Di Spaltro <[email protected]
> >wrote:
> >
> >> Right now I have a pig script to rollup timeseries data,
> >>
> >> The current format of the data is in the following tab separated value
> >> list.
> >> ts service-uuid service-name type value
> >>
> >> So the first step is to take each timestamp and snap it to a period.
> >> For 5 min rollups I use something like this:
> >> snapped = FOREACH X Generate SnapTs(300, ts) ....
> >>
> >> And then I group and average and count over that group which is great
> >> and easy.  The next bit is to show the change from 0 -> 5 min  so
> >> basically I want to take Point A avg and subtract it from Point B avg
> >> and divide by the timestamps to get the rate of change between the
> >> points, but I am not sure how to do that.  For instance, one idea I
> >> had was to create another dataset like this
> >>
> >> previous = FOREACH snapped GENERATE $0 + 300, ....
> >>
> >> GROUP previous BY (...), snapped BY (...)
> >>
> >> But that seems like a waste, I am just having a hard time modeling
> >> that.  Any help would be appreciated.
> >>
> >> Best,
> >>
> >> --
> >> Dan Di Spaltro
> >>
> >
>
>
>
> --
> Dan Di Spaltro
>

Re: Rollup time series data

Reply via email to