Asif, You have to group tuples into a bag (in fact, COUNT does require this -- only DISTINCT doesn't, but that's because it's not a built-in function, but a whole separate operator.. don't worry about it if that doesn't make sense). You may be able to avoid doing a group all depending on how you define your periods by doing things like grouping on time truncated to an hour, etc.
-Dmitriy On Mon, May 31, 2010 at 7:16 AM, Asif Jan <[email protected]> wrote: > Thanks, > > I was confused with the input to the exec method e.g. Tuple. Now I > understand that each object in tuple could be of simple or complex type. > > I have one more question though. The only way I was able to make my > function work was: > > grunt> ds = LOAD 'data/timeseries' using PigStorage('\t') as > (times:double, mag1:double, err1:double, mag2:double, err2:double); > grunt> A = group ds all; > grunt> B = foreach A {result = PeriodSearchFunc(ds); generate > flatten(result);}; > > e.g. I was forced to wrap it in a Bag and then use foreach. Is it possible > to use it as follows: > > > grunt> ds = LOAD 'data/timeseries' using PigStorage('\t') as > (times:double, mag1:double, err1:double, mag2:double, err2:double); > grunt> B = PeriodSearchFunc(ds); > > (in the same manner as the DISTINCT or COUNT built-ins) > > thanks again > > > > On May 29, 2010, at 3:01 AM, Dmitriy Ryaboy wrote: > > Sounds like you want an EvalFunc that returns a Bag of Tuples, with each >> tuple having 2 fields. Pretty straightforward. >> You don't have to implement the algebraic interface (or the accumulator >> interface) -- those are optimizations for working with large datasets, and >> not required for anything other than scalability. >> >> (hc -- chickens won't come out cause pig won't know how to serialize the >> thing. You have to turn your chicken into a bytearray). >> >> -D >> >> >> On Fri, May 28, 2010 at 5:29 PM, hc busy <[email protected]> wrote: >> >> Couldn't you give EvalFunc<any return type> any return type? so you can >>> just >>> return a Bag that contains tuples of tuples, right? And it's easy because >>> tuple is un parameterized type, (and so is Bag) so you'd declare >>> >>> >>> class myUdf extends EvalFunc<Bag>{...} >>> >>> I haven't tried this, but some times I'm tempted to return something >>> weird >>> like >>> >>> EvalFunc<Chicken> >>> >>> and see chickens come out of pig. ;-) heheheheeee >>> >>> >>> Anyways, in all seriousness, there is a UDF that converts data to bag >>> (well, >>> currently a contrib Udf, but may make into bultin) that I wrote called >>> ToBag. here's the initial declaration for it: >>> >>> public class ToBag extends EvalFunc<DataBag> >>> >>> >>> Your class would be declared similarly. >>> >>> On Fri, May 28, 2010 at 7:50 AM, Asif Jan <[email protected]> wrote: >>> >>> Hello >>>> >>>> I need some help to get started with using Pig UDF. >>>> >>>> I have time series data (time, magA, errA, magB, errB) e.g. >>>> >>>> (2345.59777,19.875,0.481,20.225,0.482) >>>> (2347.59568,19.371,0.3,20.227,0.743) >>>> (2351.6075,19.063,0.193,20.768,1.085) >>>> (2354.59702,20.689,3.047,20.873,1.758) >>>> (2356.63223,21.23,3.341,20.562,1.242) >>>> >>>> >>>> and I need to apply an algorithm that searches for periods in the data. >>>> The input to the algorithm is the (time , magX, errX ) arrays. The >>>> >>> algo >>> >>>> returns a List of all periods found. Each entry in the List is a >>>> (period_value , period_significance) pair. >>>> >>>> >>>> How can I wrap that algo as UDF ? do I have to use algebraic functions >>>> (but I saw that they could only return scalar values ); what I need to >>>> return from function is something like >>>> >>>> (1000.0,0.57) >>>> (234, .45) >>>> (100, 0.023) >>>> (6, 0.003) >>>> >>>> >>>> thanks a lot >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >
