heh, I want n*(n-1)/2 too... Maybe someone out there has an UDF that does this after a group.
:) On Wed, Jun 16, 2010 at 8:30 AM, Christian <[email protected]> wrote: > Thanks hc busy, > > E = foreach D generate (v1<v2?v1:v2) as v1, (v1<v2?v2:v1) as v2; > > F = distinct E; > > > Interesting, I didn't think about that. > > However, I think I can see a problem with this as well. If all 'A's are not > > distinct, then you might need to generate unique Id for each row > > > > Luckly this is not my case. I'm only correlating different terms. > > In any case, what I see is that the CROSS is too much expensive, I don't > know if the following filtering to the CROSS is coupled to the same MR or > first all the crosses are generated and after that in other MR the data is > filtered. It has big implications in the performance, the CROSS generates > n^2 tuples but the permutations that I want after the filtering are n * (n > - > 1) / 2. I think I should expose this in more detail in other thread later > on, I'm having some problems with JOIN's too because of that. > > Thanks again. > > > > > > On Sat, Jun 12, 2010 at 6:20 AM, Christian <[email protected]> wrote: > > > > > Hello, this is my first contact with Pig and its community ;-) > > > > > > I need to generate all the possible permutations from a bag. > > > > > > Let me explain it with examples: > > > > > > A = LOAD 'data' AS f1:chararray; > > > > > > DUMP A; > > > ('A') > > > ('B') > > > ('C') > > > > > > I can have all the possible combinations easily with CROSS: > > > > > > B = FOREACH A GENERATE $0 AS v1; > > > C = FOREACH A GENERATE $0 AS v2; > > > > > > D = CROSS B, C; > > > DUMP D; > > > ('A', 'A') > > > ('A', 'B') > > > ('A', 'C') > > > ('B', 'A') > > > ('B', 'B') > > > ('B', 'C') > > > ('C', 'A') > > > ('C', 'B') > > > ('C', 'C') > > > > > > But what I need are the permutations. The result I want to obtain is > > > something like: > > > > > > DUMP R; > > > ('A', 'A') > > > ('A', 'B') > > > ('A', 'C') > > > ('B', 'B') > > > ('B', 'C') > > > ('C', 'C') > > > > > > My first idea to solve that was to generate de CROSS and then FILTER > > like: > > > > > > R = FILTER D BY $0 < $1; > > > > > > It works but I would like to know if there is a better way to do this > > > without having to use string comparison and assume that only one field > is > > > used. For example a real scenario would look like: > > > > > > DUMP A; > > > ('A1', 'A2') > > > ('B1', 'B2') > > > > > > DUMP R; > > > ('A1', 'A2', 'A1', 'A2') > > > ('A1', 'A2', 'B1', 'B2') > > > ('B1', 'B2', 'B1', 'B2') > > > > > > Thank you in advance. > > > Christian > > > > > >
