Re: Need help with grouping.

dhaval deshpande Wed, 25 Nov 2009 20:54:20 -0800

Hi Dmitriy,
                   When I find the start and end interval for each tuple
then how do I store it in memory or in some place or file so that when I
read the next tuple I could check with my start and end interval and if it
does belong to the same group then add it to the bucket? Also I didnt
understand the meaning of sidefile. Can you please explain me that? Sorry
for the trouble.


Thanks,
Dhaval.
On Wed, Nov 25, 2009 at 3:59 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Hi Dhaval,
> First, I want to caution you against doing this :-). You are increasing the
> cardinality of your data quite a bit, as every record gets repeated as many
> times in the output as there are overlapping time periods. I imagine that
> can lead to multiplying the number of tuples by a fairly large factor.
>
> Assuming you ignore this warning, and proceed anyway...
>
> Here's an idea of how you could go about this:
>
> 0. determine needed degree of parallelism, P.
>
> 1. Find timestamps such that they divide your records into P roughly even
> chunks (they may be equidistant if your timestamps are uniformly
> distributed
> along the time range, or they may need to be adjusted to follow the actual
> distribution of the data).
>
> 2. For each tuple, emit it to all intervals of the timeline, as described
> by
> the timestamps from the previous step, such that (interval_start >=
> tuple_start && interval_end <= tuple_end) ( this double counts if you have
> 0-width intervals, btw, so be careful there).
>
> 3. Spin up a reducer for each bucket. Perform the grouping in memory by
> streaming tuples sorted by start time; any time you see a tuple whose start
> is after some group's end, you can output that group.
>
> Approach for implementing this in Pig:
>
> Finding the boundaries can be done by a finding min, max, and count, and
> dividing accordingly. These values can be written into a side file, then
> you
> can stream the data through a custom binary that will perform step 2, by
> reading the sidefile, and outputting the records it sees with the bucket
> number as an added first field. Step 3 can be done by grouping by first
> field (bucket #) of step 2's output, streaming once more through a custom
> aggregator.
>
> This is a good use case that manages to expose a with the UDF apis -- it
> would be nice to output multiple records per processed tuple in exec(), to
> allow the kind of processing actual Pig operators sometimes do, with
> buffering inputs and the like.
>
> -Dmitriy
>
> On Wed, Nov 25, 2009 at 5:10 PM, dhaval deshpande <
> [email protected]> wrote:
>
> > Hi Zaki,
> >              My timestamp field is chararray ,  so pig will not recognize
> > it as timestamps right? And it will not take care of it. I did exactly
> what
> > you had told me before and then realized the same problem and then
> decided
> > to write a UDF. But again had the problem of being handling only one
> tuple
> > at a time as I had mentioned above.
> >
> > Thanks,
> > Dhaval.
> >
> > On Wed, Nov 25, 2009 at 2:57 PM, zaki rahaman <[email protected]
> > >wrote:
> >
> > > It's a lot simpler than that... you simply have to tell pig to group
> your
> > > data on the timestamp field... it takes care of the rest..
> > >
> > > something like..
> > >
> > > A = LOAD 'data' AS (name, id, timestamp);
> > > B = GROUP A by timestamp.
> > >
> > > Yes, it's that easy.
> > >
> > > On Wed, Nov 25, 2009 at 4:53 PM, dhaval deshpande <
> > > [email protected]> wrote:
> > >
> > > > Hi,
> > > >       I am back with a questions again :). This time i can explain
> > better
> > > > because I have explored little better than what i did last time :). I
> > > have
> > > > three fields in my table. And they are name, id, time from and time
> to
> > > > which
> > > > tuple existed in the database. so for example 1-4 means tuple existed
> > > from
> > > > 1sec to 4 sec in the database. So my table looks like this.
> > > >
> > > > name1 , 1, 1-4
> > > > name2, 2, 1-6
> > > > name3, 3, 2-4
> > > > name4, 4, 3-7
> > > > name5, 5, 4-6
> > > >
> > > > Now I want to group this table using my timestamp field. and I wanted
> > my
> > > > intended output to be like this
> > > >
> > > > (1-4, {(name1,1,1-4),(name3,3,2-4)})
> > > > (1-6, {(name1,1,1-4),(name2,2,1-6),(name3,3,2-4),(name5,5,4-6)})
> > > > (2-4, {(name3, 3, 2-4)})
> > > > (3-7, {(name4,4,3-7),(name5,5,4-6)})
> > > > (4-6, {(name5,5,4-6)})
> > > >
> > > > But the problem I was facing was when I extend the EvalFunc class
> then
> > I
> > > > work with a single tuple at a time and I dont have all the other
> tuple
> > > > information so that I can iterate though it and check if any other
> > tuple
> > > > exist between these time stamps and add it to the group as well. Any
> > > ideas?
> > > >
> > >
> > >
> > >
> > > --
> > > Zaki Rahaman
> > >
> >
>

Re: Need help with grouping.

Reply via email to