To be specific, he's looking for connected components in the graph. It's not overly easy to do in an ETL tool (or in pig), but if you can wrap the script in a loop it is possible.
There are actually some really interesting parallel algorithms for finding connected components. If you know you are only going to have two keys, it is a bit simpler to code (but not any more efficient). Essentially, you can implement your union-find algorithm as a series of sorts and merges, which on a large parallel system is actually quite fast. Tanton On Thu, Jun 10, 2010 at 3:33 PM, hc busy <[email protected]> wrote: > What's a transitive closure? > > On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <[email protected]> wrote: > >> I think what he wants is a transitive closure of the relation, which >> is not achievable in SQL-like languages alone (first order logic >> expressive power). >> I suppose Pig Latin falls in this category. >> >> >> -- Gianmarco >> >> On Thu, Jun 10, 2010 at 19:54, hc busy <[email protected]> wrote: >> > Is this like a tricky interview question? I don't see the pattern between >> > those three numbers you listed and the sample of the table. >> > >> > 770011 770083 524 1e-120 89 12 >> > 770083 770011 494 1e-120 39 100 >> > >> > ahh, I guess these are related because id1=id2 an id2=id1... Here's a >> first >> > pass at the problem. Project: >> > >> > P1 = foreach table generate id1 as id1, id2 as id2, *; >> > P2 = foreach table generate id2 as id1, id1 as id2, *; >> > J = join P1 by (id1, id2), P2 by (id1,id2); >> > >> > and now J contains pairs of rows from original table where id1 and id2 >> are >> > reversed. >> > >> > is this what you want? >> > >> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo < >> > [email protected]> wrote: >> > >> >> Hi everyone, today I came across with a particular query that I don't >> know >> >> how to model in PIG. Part of my data looks like this: >> >> >> >> Id1 Id2 Sc Va P1 P2 >> >> --------- --------- ----- --------- ----- ---- >> >> 770011 990201 401 1e-125 100 65 >> >> 990201 770011 440 1e-125 100 42 >> >> 770011 770083 524 1e-120 89 12 >> >> 770083 770011 494 1e-120 39 100 >> >> 990201 770083 341 1e-125 73 41 >> >> 770083 990201 421 1e-125 90 85 >> >> . >> >> . >> >> . >> >> >> >> what I would like to retrieve is something like >> >> this: 770011 990201 770083 >> >> because they are records actually related. >> >> Any kind of ideas are highly appreciated. Thanks in advanced. >> >> >> >> Renato M. >> >> >> > >> >
