Hi everybody, thanks a lot for your responses.

I am actually not looking for a transitive closure, I am not trying to
"infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy: that is
in short terms a transitive closure ^^)  because I have the data that proves
it. And yeah I am aware of first logic expressive power, but maybe I will
think about giving an inference engine a try some time in the future.

I am actually looking at a connected graph-like problem. The sample records
resemble a bidirectional triangle.

            990201
         /              \
        /                \
       /                  \
      /                    \
770011  ------- 770083

I tried using a smaller version of the problem by using SQL and got a
horribly huge query which is a non-scalable possibility for me. I have over
500Gbs in structured files. I used a triple join but I had to retrieve all
existing possibilities.
So that is why I thought on using something like Hc.busy mentioned.

Queries: idx1, idx2, sc, va, p1, p2
possibilities: possibs = foreach queries generate idx1, flatten(a triple
self join of the data)

and by using the flattening command, get all possible combinations, but I am
not sure that would be the correct approach. Anyhow I thought maybe
performing some breadth first search but that would give me an algorithm
O(n2)  so  )=
Hey Tanton how would you implement and use an union-find structure? Do you
think it is possible with PIG?

Thanks again.


2010/6/10 Tanton Gibbs <[email protected]>

> To be specific, he's looking for connected components in the graph.
>
> It's not overly easy to do in an ETL tool (or in pig), but if you can
> wrap the script in a loop it is possible.
>
> There are actually some really interesting parallel algorithms for
> finding connected components.  If you know you are only going to have
> two keys, it is a bit simpler to code (but not any more efficient).
> Essentially, you can implement your union-find algorithm as a series
> of sorts and merges, which on a large parallel system is actually
> quite fast.
>
> Tanton
>
> On Thu, Jun 10, 2010 at 3:33 PM, hc busy <[email protected]> wrote:
> > What's a transitive closure?
> >
> > On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <[email protected]>
> wrote:
> >
> >> I think what he wants is a transitive closure of the relation, which
> >> is not achievable in SQL-like languages alone (first order logic
> >> expressive power).
> >> I suppose Pig Latin falls in this category.
> >>
> >>
> >> -- Gianmarco
> >>
> >> On Thu, Jun 10, 2010 at 19:54, hc busy <[email protected]> wrote:
> >> > Is this like a tricky interview question? I don't see the pattern
> between
> >> > those three numbers you listed and the sample of the table.
> >> >
> >> > 770011 770083 524 1e-120 89 12
> >> > 770083 770011 494 1e-120 39 100
> >> >
> >> > ahh, I guess these are related because id1=id2 an id2=id1... Here's a
> >> first
> >> > pass at the problem. Project:
> >> >
> >> > P1 = foreach table generate id1 as id1, id2 as id2, *;
> >> > P2 = foreach table generate id2 as id1, id1 as id2, *;
> >> > J = join P1 by (id1, id2), P2 by (id1,id2);
> >> >
> >> > and now J contains pairs of rows from original table where id1 and id2
> >> are
> >> > reversed.
> >> >
> >> > is this what you want?
> >> >
> >> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
> >> > [email protected]> wrote:
> >> >
> >> >> Hi everyone, today I came across with a particular query that I don't
> >> know
> >> >> how to model in PIG. Part of my data looks like this:
> >> >>
> >> >> Id1 Id2 Sc Va P1 P2
> >> >> --------- --------- ----- --------- ----- ----
> >> >> 770011 990201 401 1e-125 100 65
> >> >> 990201 770011 440 1e-125 100 42
> >> >> 770011 770083 524 1e-120 89 12
> >> >> 770083 770011 494 1e-120 39 100
> >> >> 990201 770083 341 1e-125 73 41
> >> >> 770083 990201 421 1e-125 90 85
> >> >> .
> >> >> .
> >> >> .
> >> >>
> >> >> what I would like to retrieve is something like
> >> >> this:                                             770011 990201
> 770083
> >> >> because they are records actually related.
> >> >> Any kind of ideas are highly appreciated. Thanks in advanced.
> >> >>
> >> >> Renato M.
> >> >>
> >> >
> >>
> >
>

Reply via email to