heh, I want n*(n-1)/2 too... Maybe someone out there has an UDF that does
this after a group.

:)

On Wed, Jun 16, 2010 at 8:30 AM, Christian <[email protected]> wrote:

> Thanks hc busy,
>
> E = foreach D generate (v1<v2?v1:v2) as v1, (v1<v2?v2:v1) as v2;
>
> F = distinct E;
>
>
> Interesting, I didn't think about that.
>
> However, I think I can see a problem with this as well. If all 'A's are not
> > distinct, then you might need to generate unique Id for each row
> >
>
> Luckly this is not my case. I'm only correlating different terms.
>
> In any case, what I see is that the CROSS is too much expensive, I don't
> know if the following filtering to the CROSS is coupled to the same MR or
> first all the crosses are generated and after that in other MR the data is
> filtered. It has big implications in the performance, the CROSS generates
> n^2 tuples but the permutations that I want after the filtering are n * (n
> -
> 1) / 2. I think I should expose this in more detail in other thread later
> on, I'm having some problems with JOIN's too because of that.
>
> Thanks again.
>
>
> >
> > On Sat, Jun 12, 2010 at 6:20 AM, Christian <[email protected]> wrote:
> >
> > > Hello, this is my first contact with Pig and its community ;-)
> > >
> > > I need to generate all the possible permutations from a bag.
> > >
> > > Let me explain it with examples:
> > >
> > > A = LOAD 'data' AS f1:chararray;
> > >
> > > DUMP A;
> > > ('A')
> > > ('B')
> > > ('C')
> > >
> > > I can have all the possible combinations easily with CROSS:
> > >
> > > B = FOREACH A GENERATE $0 AS v1;
> > > C = FOREACH A GENERATE $0 AS v2;
> > >
> > > D = CROSS B, C;
> > > DUMP D;
> > > ('A', 'A')
> > > ('A', 'B')
> > > ('A', 'C')
> > > ('B', 'A')
> > > ('B', 'B')
> > > ('B', 'C')
> > > ('C', 'A')
> > > ('C', 'B')
> > > ('C', 'C')
> > >
> > > But what I need are the permutations. The result I want to obtain is
> > > something like:
> > >
> > > DUMP R;
> > > ('A', 'A')
> > > ('A', 'B')
> > > ('A', 'C')
> > > ('B', 'B')
> > > ('B', 'C')
> > > ('C', 'C')
> > >
> > > My first idea to solve that was to generate de CROSS and then FILTER
> > like:
> > >
> > > R = FILTER D BY $0 < $1;
> > >
> > > It works but I would like to know if there is a better way to do this
> > > without having to use string comparison and assume that only one field
> is
> > > used. For example a real scenario would look like:
> > >
> > > DUMP A;
> > > ('A1', 'A2')
> > > ('B1', 'B2')
> > >
> > > DUMP R;
> > > ('A1', 'A2', 'A1', 'A2')
> > > ('A1', 'A2', 'B1', 'B2')
> > > ('B1', 'B2', 'B1', 'B2')
> > >
> > > Thank you in advance.
> > > Christian
> > >
> >
>

Reply via email to