It really depends on the depth of your graph. Are you only dealing with connected components of depth 3 or could they be deeper.
For instance, can you have x -> y, y -> z, and z -> a? If so, do you need your record to be x, y, z, a? Or, are you guaranteed that it will always be x -> z and y -> z? I really need more information about your data before I can offer too much advice. On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo <[email protected]> wrote: > Hi everybody, thanks a lot for your responses. > > I am actually not looking for a transitive closure, I am not trying to > "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy: that is > in short terms a transitive closure ^^) because I have the data that proves > it. And yeah I am aware of first logic expressive power, but maybe I will > think about giving an inference engine a try some time in the future. > > I am actually looking at a connected graph-like problem. The sample records > resemble a bidirectional triangle. > > 990201 > / \ > / \ > / \ > / \ > 770011 ------- 770083 > > I tried using a smaller version of the problem by using SQL and got a > horribly huge query which is a non-scalable possibility for me. I have over > 500Gbs in structured files. I used a triple join but I had to retrieve all > existing possibilities. > So that is why I thought on using something like Hc.busy mentioned. > > Queries: idx1, idx2, sc, va, p1, p2 > possibilities: possibs = foreach queries generate idx1, flatten(a triple > self join of the data) > > and by using the flattening command, get all possible combinations, but I am > not sure that would be the correct approach. Anyhow I thought maybe > performing some breadth first search but that would give me an algorithm > O(n2) so )= > Hey Tanton how would you implement and use an union-find structure? Do you > think it is possible with PIG? > > Thanks again. > > > 2010/6/10 Tanton Gibbs <[email protected]> > >> To be specific, he's looking for connected components in the graph. >> >> It's not overly easy to do in an ETL tool (or in pig), but if you can >> wrap the script in a loop it is possible. >> >> There are actually some really interesting parallel algorithms for >> finding connected components. If you know you are only going to have >> two keys, it is a bit simpler to code (but not any more efficient). >> Essentially, you can implement your union-find algorithm as a series >> of sorts and merges, which on a large parallel system is actually >> quite fast. >> >> Tanton >> >> On Thu, Jun 10, 2010 at 3:33 PM, hc busy <[email protected]> wrote: >> > What's a transitive closure? >> > >> > On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <[email protected]> >> wrote: >> > >> >> I think what he wants is a transitive closure of the relation, which >> >> is not achievable in SQL-like languages alone (first order logic >> >> expressive power). >> >> I suppose Pig Latin falls in this category. >> >> >> >> >> >> -- Gianmarco >> >> >> >> On Thu, Jun 10, 2010 at 19:54, hc busy <[email protected]> wrote: >> >> > Is this like a tricky interview question? I don't see the pattern >> between >> >> > those three numbers you listed and the sample of the table. >> >> > >> >> > 770011 770083 524 1e-120 89 12 >> >> > 770083 770011 494 1e-120 39 100 >> >> > >> >> > ahh, I guess these are related because id1=id2 an id2=id1... Here's a >> >> first >> >> > pass at the problem. Project: >> >> > >> >> > P1 = foreach table generate id1 as id1, id2 as id2, *; >> >> > P2 = foreach table generate id2 as id1, id1 as id2, *; >> >> > J = join P1 by (id1, id2), P2 by (id1,id2); >> >> > >> >> > and now J contains pairs of rows from original table where id1 and id2 >> >> are >> >> > reversed. >> >> > >> >> > is this what you want? >> >> > >> >> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo < >> >> > [email protected]> wrote: >> >> > >> >> >> Hi everyone, today I came across with a particular query that I don't >> >> know >> >> >> how to model in PIG. Part of my data looks like this: >> >> >> >> >> >> Id1 Id2 Sc Va P1 P2 >> >> >> --------- --------- ----- --------- ----- ---- >> >> >> 770011 990201 401 1e-125 100 65 >> >> >> 990201 770011 440 1e-125 100 42 >> >> >> 770011 770083 524 1e-120 89 12 >> >> >> 770083 770011 494 1e-120 39 100 >> >> >> 990201 770083 341 1e-125 73 41 >> >> >> 770083 990201 421 1e-125 90 85 >> >> >> . >> >> >> . >> >> >> . >> >> >> >> >> >> what I would like to retrieve is something like >> >> >> this: 770011 990201 >> 770083 >> >> >> because they are records actually related. >> >> >> Any kind of ideas are highly appreciated. Thanks in advanced. >> >> >> >> >> >> Renato M. >> >> >> >> >> > >> >> >> > >> >
