Thanks hc busy, E = foreach D generate (v1<v2?v1:v2) as v1, (v1<v2?v2:v1) as v2;
F = distinct E; Interesting, I didn't think about that. However, I think I can see a problem with this as well. If all 'A's are not > distinct, then you might need to generate unique Id for each row > Luckly this is not my case. I'm only correlating different terms. In any case, what I see is that the CROSS is too much expensive, I don't know if the following filtering to the CROSS is coupled to the same MR or first all the crosses are generated and after that in other MR the data is filtered. It has big implications in the performance, the CROSS generates n^2 tuples but the permutations that I want after the filtering are n * (n - 1) / 2. I think I should expose this in more detail in other thread later on, I'm having some problems with JOIN's too because of that. Thanks again. > > On Sat, Jun 12, 2010 at 6:20 AM, Christian <[email protected]> wrote: > > > Hello, this is my first contact with Pig and its community ;-) > > > > I need to generate all the possible permutations from a bag. > > > > Let me explain it with examples: > > > > A = LOAD 'data' AS f1:chararray; > > > > DUMP A; > > ('A') > > ('B') > > ('C') > > > > I can have all the possible combinations easily with CROSS: > > > > B = FOREACH A GENERATE $0 AS v1; > > C = FOREACH A GENERATE $0 AS v2; > > > > D = CROSS B, C; > > DUMP D; > > ('A', 'A') > > ('A', 'B') > > ('A', 'C') > > ('B', 'A') > > ('B', 'B') > > ('B', 'C') > > ('C', 'A') > > ('C', 'B') > > ('C', 'C') > > > > But what I need are the permutations. The result I want to obtain is > > something like: > > > > DUMP R; > > ('A', 'A') > > ('A', 'B') > > ('A', 'C') > > ('B', 'B') > > ('B', 'C') > > ('C', 'C') > > > > My first idea to solve that was to generate de CROSS and then FILTER > like: > > > > R = FILTER D BY $0 < $1; > > > > It works but I would like to know if there is a better way to do this > > without having to use string comparison and assume that only one field is > > used. For example a real scenario would look like: > > > > DUMP A; > > ('A1', 'A2') > > ('B1', 'B2') > > > > DUMP R; > > ('A1', 'A2', 'A1', 'A2') > > ('A1', 'A2', 'B1', 'B2') > > ('B1', 'B2', 'B1', 'B2') > > > > Thank you in advance. > > Christian > > >
