Hi everybody, thanks again for the responses. I am really obtaining many ideas (:
About the depth of my data, yeah I am only looking for connected components of depth three which I think will be a great help because my BFS would be limited in the number of iterations. And about the example I must have these records x -> y, y -> z and z -> x in order to output the x, y, z result record. But there can probably be elements who do not close the graph. For example, I might have a record like x -> b, and b doesn't references nobody, so I would have to ignore it. Is that what you meant Tanton? What about a self-join three times?Would that be a bad idea? Thanks again. Renato M. 2010/6/11 Dmitriy Ryaboy <[email protected]> > hc, self-join should just work, but if it doesn't: > > split table into table_2 if 1==1, table3 if 1==1; > > OR > > table_2 = foreach table generate *; > table3 = foreach table generate *; > > AND THEN > > T = join table by id1, table_2 by id2, table_3 by id3 > > -D > > On Fri, Jun 11, 2010 at 10:59 AM, hc busy <[email protected]> wrote: > > > Yeah, that IS hard in pig. I'm not even sure how to do a self-join in > Pig. > > Like you can't really say > > > > T = join Table by id1, Table by id2, Table by id3; > > > > I think PigLatin will complain that it's confused which Table is and > which > > id1 goes with which table. > > > > I had been proposing that we allow PigLatin to allow recursive functions, > > that way we can do loops. But recursive functions doesn't fit in the > > data-flow language paradigm. > > > > But I think people have offered many alternatives in terms of scripting > > language that can wrap PigLatin that has loop and flow control > constructs. > > > > > > On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <[email protected] > > >wrote: > > > > > It really depends on the depth of your graph. Are you only dealing > > > with connected components of depth 3 or could they be deeper. > > > > > > For instance, can you have x -> y, y -> z, and z -> a? If so, do you > > > need your record to be x, y, z, a? Or, are you guaranteed that it > > > will always be x -> z and y -> z? > > > > > > I really need more information about your data before I can offer too > > > much advice. > > > > > > On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo > > > <[email protected]> wrote: > > > > Hi everybody, thanks a lot for your responses. > > > > > > > > I am actually not looking for a transitive closure, I am not trying > to > > > > "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy: > that > > > is > > > > in short terms a transitive closure ^^) because I have the data that > > > proves > > > > it. And yeah I am aware of first logic expressive power, but maybe I > > will > > > > think about giving an inference engine a try some time in the future. > > > > > > > > I am actually looking at a connected graph-like problem. The sample > > > records > > > > resemble a bidirectional triangle. > > > > > > > > 990201 > > > > / \ > > > > / \ > > > > / \ > > > > / \ > > > > 770011 ------- 770083 > > > > > > > > I tried using a smaller version of the problem by using SQL and got a > > > > horribly huge query which is a non-scalable possibility for me. I > have > > > over > > > > 500Gbs in structured files. I used a triple join but I had to > retrieve > > > all > > > > existing possibilities. > > > > So that is why I thought on using something like Hc.busy mentioned. > > > > > > > > Queries: idx1, idx2, sc, va, p1, p2 > > > > possibilities: possibs = foreach queries generate idx1, flatten(a > > triple > > > > self join of the data) > > > > > > > > and by using the flattening command, get all possible combinations, > but > > I > > > am > > > > not sure that would be the correct approach. Anyhow I thought maybe > > > > performing some breadth first search but that would give me an > > algorithm > > > > O(n2) so )= > > > > Hey Tanton how would you implement and use an union-find structure? > Do > > > you > > > > think it is possible with PIG? > > > > > > > > Thanks again. > > > > > > > > > > > > 2010/6/10 Tanton Gibbs <[email protected]> > > > > > > > >> To be specific, he's looking for connected components in the graph. > > > >> > > > >> It's not overly easy to do in an ETL tool (or in pig), but if you > can > > > >> wrap the script in a loop it is possible. > > > >> > > > >> There are actually some really interesting parallel algorithms for > > > >> finding connected components. If you know you are only going to > have > > > >> two keys, it is a bit simpler to code (but not any more efficient). > > > >> Essentially, you can implement your union-find algorithm as a series > > > >> of sorts and merges, which on a large parallel system is actually > > > >> quite fast. > > > >> > > > >> Tanton > > > >> > > > >> On Thu, Jun 10, 2010 at 3:33 PM, hc busy <[email protected]> wrote: > > > >> > What's a transitive closure? > > > >> > > > > >> > On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco < > [email protected] > > > > > > >> wrote: > > > >> > > > > >> >> I think what he wants is a transitive closure of the relation, > > which > > > >> >> is not achievable in SQL-like languages alone (first order logic > > > >> >> expressive power). > > > >> >> I suppose Pig Latin falls in this category. > > > >> >> > > > >> >> > > > >> >> -- Gianmarco > > > >> >> > > > >> >> On Thu, Jun 10, 2010 at 19:54, hc busy <[email protected]> > wrote: > > > >> >> > Is this like a tricky interview question? I don't see the > pattern > > > >> between > > > >> >> > those three numbers you listed and the sample of the table. > > > >> >> > > > > >> >> > 770011 770083 524 1e-120 89 12 > > > >> >> > 770083 770011 494 1e-120 39 100 > > > >> >> > > > > >> >> > ahh, I guess these are related because id1=id2 an id2=id1... > > Here's > > > a > > > >> >> first > > > >> >> > pass at the problem. Project: > > > >> >> > > > > >> >> > P1 = foreach table generate id1 as id1, id2 as id2, *; > > > >> >> > P2 = foreach table generate id2 as id1, id1 as id2, *; > > > >> >> > J = join P1 by (id1, id2), P2 by (id1,id2); > > > >> >> > > > > >> >> > and now J contains pairs of rows from original table where id1 > > and > > > id2 > > > >> >> are > > > >> >> > reversed. > > > >> >> > > > > >> >> > is this what you want? > > > >> >> > > > > >> >> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo < > > > >> >> > [email protected]> wrote: > > > >> >> > > > > >> >> >> Hi everyone, today I came across with a particular query that > I > > > don't > > > >> >> know > > > >> >> >> how to model in PIG. Part of my data looks like this: > > > >> >> >> > > > >> >> >> Id1 Id2 Sc Va P1 P2 > > > >> >> >> --------- --------- ----- --------- ----- ---- > > > >> >> >> 770011 990201 401 1e-125 100 65 > > > >> >> >> 990201 770011 440 1e-125 100 42 > > > >> >> >> 770011 770083 524 1e-120 89 12 > > > >> >> >> 770083 770011 494 1e-120 39 100 > > > >> >> >> 990201 770083 341 1e-125 73 41 > > > >> >> >> 770083 990201 421 1e-125 90 85 > > > >> >> >> . > > > >> >> >> . > > > >> >> >> . > > > >> >> >> > > > >> >> >> what I would like to retrieve is something like > > > >> >> >> this: 770011 > 990201 > > > >> 770083 > > > >> >> >> because they are records actually related. > > > >> >> >> Any kind of ideas are highly appreciated. Thanks in advanced. > > > >> >> >> > > > >> >> >> Renato M. > > > >> >> >> > > > >> >> > > > > >> >> > > > >> > > > > >> > > > > > > > > > >
