hc, self-join should just work, but if it doesn't:

split table into table_2 if 1==1, table3 if 1==1;

OR

table_2 = foreach table generate *;
table3 = foreach table generate *;

AND THEN

T = join table by id1, table_2 by id2, table_3 by id3

-D

On Fri, Jun 11, 2010 at 10:59 AM, hc busy <[email protected]> wrote:

> Yeah, that IS hard in pig. I'm not even sure how to do a self-join in Pig.
> Like you can't really say
>
> T = join Table by id1, Table by id2, Table by id3;
>
> I think PigLatin will complain that it's confused which Table is and which
> id1 goes with which table.
>
> I had been proposing that we allow PigLatin to allow recursive functions,
> that way we can do loops. But recursive functions doesn't fit in the
> data-flow language paradigm.
>
> But I think people have offered many alternatives in terms of scripting
> language that can wrap PigLatin that has loop and flow control constructs.
>
>
> On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <[email protected]
> >wrote:
>
> > It really depends on the depth of your graph.  Are you only dealing
> > with connected components of depth 3 or could they be deeper.
> >
> > For instance, can you have x -> y, y -> z, and z -> a?  If so, do you
> > need your record to be x, y, z, a?  Or, are you guaranteed that it
> > will always be x -> z and y -> z?
> >
> > I really need more information about your data before I can offer too
> > much advice.
> >
> > On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
> > <[email protected]> wrote:
> > > Hi everybody, thanks a lot for your responses.
> > >
> > > I am actually not looking for a transitive closure, I am not trying to
> > > "infer" that given *x -> y* and *y -> z*, then *x -> z *(@hc.busy: that
> > is
> > > in short terms a transitive closure ^^)  because I have the data that
> > proves
> > > it. And yeah I am aware of first logic expressive power, but maybe I
> will
> > > think about giving an inference engine a try some time in the future.
> > >
> > > I am actually looking at a connected graph-like problem. The sample
> > records
> > > resemble a bidirectional triangle.
> > >
> > >            990201
> > >         /              \
> > >        /                \
> > >       /                  \
> > >      /                    \
> > > 770011  ------- 770083
> > >
> > > I tried using a smaller version of the problem by using SQL and got a
> > > horribly huge query which is a non-scalable possibility for me. I have
> > over
> > > 500Gbs in structured files. I used a triple join but I had to retrieve
> > all
> > > existing possibilities.
> > > So that is why I thought on using something like Hc.busy mentioned.
> > >
> > > Queries: idx1, idx2, sc, va, p1, p2
> > > possibilities: possibs = foreach queries generate idx1, flatten(a
> triple
> > > self join of the data)
> > >
> > > and by using the flattening command, get all possible combinations, but
> I
> > am
> > > not sure that would be the correct approach. Anyhow I thought maybe
> > > performing some breadth first search but that would give me an
> algorithm
> > > O(n2)  so  )=
> > > Hey Tanton how would you implement and use an union-find structure? Do
> > you
> > > think it is possible with PIG?
> > >
> > > Thanks again.
> > >
> > >
> > > 2010/6/10 Tanton Gibbs <[email protected]>
> > >
> > >> To be specific, he's looking for connected components in the graph.
> > >>
> > >> It's not overly easy to do in an ETL tool (or in pig), but if you can
> > >> wrap the script in a loop it is possible.
> > >>
> > >> There are actually some really interesting parallel algorithms for
> > >> finding connected components.  If you know you are only going to have
> > >> two keys, it is a bit simpler to code (but not any more efficient).
> > >> Essentially, you can implement your union-find algorithm as a series
> > >> of sorts and merges, which on a large parallel system is actually
> > >> quite fast.
> > >>
> > >> Tanton
> > >>
> > >> On Thu, Jun 10, 2010 at 3:33 PM, hc busy <[email protected]> wrote:
> > >> > What's a transitive closure?
> > >> >
> > >> > On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <[email protected]
> >
> > >> wrote:
> > >> >
> > >> >> I think what he wants is a transitive closure of the relation,
> which
> > >> >> is not achievable in SQL-like languages alone (first order logic
> > >> >> expressive power).
> > >> >> I suppose Pig Latin falls in this category.
> > >> >>
> > >> >>
> > >> >> -- Gianmarco
> > >> >>
> > >> >> On Thu, Jun 10, 2010 at 19:54, hc busy <[email protected]> wrote:
> > >> >> > Is this like a tricky interview question? I don't see the pattern
> > >> between
> > >> >> > those three numbers you listed and the sample of the table.
> > >> >> >
> > >> >> > 770011 770083 524 1e-120 89 12
> > >> >> > 770083 770011 494 1e-120 39 100
> > >> >> >
> > >> >> > ahh, I guess these are related because id1=id2 an id2=id1...
> Here's
> > a
> > >> >> first
> > >> >> > pass at the problem. Project:
> > >> >> >
> > >> >> > P1 = foreach table generate id1 as id1, id2 as id2, *;
> > >> >> > P2 = foreach table generate id2 as id1, id1 as id2, *;
> > >> >> > J = join P1 by (id1, id2), P2 by (id1,id2);
> > >> >> >
> > >> >> > and now J contains pairs of rows from original table where id1
> and
> > id2
> > >> >> are
> > >> >> > reversed.
> > >> >> >
> > >> >> > is this what you want?
> > >> >> >
> > >> >> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo <
> > >> >> > [email protected]> wrote:
> > >> >> >
> > >> >> >> Hi everyone, today I came across with a particular query that I
> > don't
> > >> >> know
> > >> >> >> how to model in PIG. Part of my data looks like this:
> > >> >> >>
> > >> >> >> Id1 Id2 Sc Va P1 P2
> > >> >> >> --------- --------- ----- --------- ----- ----
> > >> >> >> 770011 990201 401 1e-125 100 65
> > >> >> >> 990201 770011 440 1e-125 100 42
> > >> >> >> 770011 770083 524 1e-120 89 12
> > >> >> >> 770083 770011 494 1e-120 39 100
> > >> >> >> 990201 770083 341 1e-125 73 41
> > >> >> >> 770083 990201 421 1e-125 90 85
> > >> >> >> .
> > >> >> >> .
> > >> >> >> .
> > >> >> >>
> > >> >> >> what I would like to retrieve is something like
> > >> >> >> this:                                             770011 990201
> > >> 770083
> > >> >> >> because they are records actually related.
> > >> >> >> Any kind of ideas are highly appreciated. Thanks in advanced.
> > >> >> >>
> > >> >> >> Renato M.
> > >> >> >>
> > >> >> >
> > >> >>
> > >> >
> > >>
> > >
> >
>

Reply via email to