The DateTime UDFs in PiggyBank may be helpful. See http://github.com/apache/pig/tree/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/datetime/truncate/
They feature date truncation, which can help to group logs into whole time units: day/hour/minute/second, etc. It sounds like a proper rounding function, by datetime unit (day, hour, minute, etc.) with a number of splits would be a good addition. I'll try and get that in there if people think its a good idea? I've been meaning to add DateTime support to Pig, but don't have time in the near future to do so :/ Russ On Fri, May 14, 2010 at 9:44 AM, Dan Di Spaltro <[email protected]>wrote: > Thanks for the response, I appreciate the verbose example. > > On Wed, May 12, 2010 at 5:43 PM, hc busy <[email protected]> wrote: > > Well, time series data is usually regular periods (the time between one > > timestamp and next is always 300 seconds), so you can just divide the > delta > > by the constant 300 > > > > analysis = for each snapped generate ((double)delta / 300.0) as > per_period; > > > > I guess what you are suggesting would work > > > > snapped = foreach X generate start_of_period, total_so_far; > > snapped2 = foreach X generate start_of_period-300 as > > start_of_previous_period, total_so_far; > > diff_join = join snapped by start_of_period, snapped2 > > bystart_of_previous_period; > > diff = foreach diff_join generate start_of_previous_period as > > start_of_period, (snapped2::total_so_far - snapped::total_so_far)/300 as > > rate_of_change; > > > > > > This seem to fit the M-R paradigm, trade efficiency for scalability. > > Because you can use this to compute arbitrarily large dataset just by > buying > > twice the computer as you would otherwise need to compute it using an > > iterator... But remember, you won't need to write *any* unit tests, > > synchronization, file system, or operating system to make it happen, just > > the above five lines of code. > > This is probably the best way of putting this possible. Thanks for > the input =). > > > > > Does this sound right to everyone else? > > > > > > > > On Thu, May 6, 2010 at 10:13 PM, Dan Di Spaltro <[email protected] > >wrote: > > > >> Right now I have a pig script to rollup timeseries data, > >> > >> The current format of the data is in the following tab separated value > >> list. > >> ts service-uuid service-name type value > >> > >> So the first step is to take each timestamp and snap it to a period. > >> For 5 min rollups I use something like this: > >> snapped = FOREACH X Generate SnapTs(300, ts) .... > >> > >> And then I group and average and count over that group which is great > >> and easy. The next bit is to show the change from 0 -> 5 min so > >> basically I want to take Point A avg and subtract it from Point B avg > >> and divide by the timestamps to get the rate of change between the > >> points, but I am not sure how to do that. For instance, one idea I > >> had was to create another dataset like this > >> > >> previous = FOREACH snapped GENERATE $0 + 300, .... > >> > >> GROUP previous BY (...), snapped BY (...) > >> > >> But that seems like a waste, I am just having a hard time modeling > >> that. Any help would be appreciated. > >> > >> Best, > >> > >> -- > >> Dan Di Spaltro > >> > > > > > > -- > Dan Di Spaltro >
