Re: Help with a tricky query

Renato Marroquín Mogrovejo Tue, 22 Jun 2010 21:11:33 -0700

Hi everybody, sorry for not being able to answer back before, but I have
just been way too busy lately.
Thanks a lot I think that will work, I haven't been able to test yet, but I
am getting there these week, I will post about it later. Thanks again.


Renato M.

2010/6/13 Dmitriy Ryaboy <[email protected]>

> I don't see why not.
>
> If you have (start, end) pairs: (a, b), (b, c), (c, a)
> And let's say there's something non-triangular also: (c, d), (d, e), (e, f)
>
> You want to self join on rel.start == rel.end, which will produce
>
> a, (a, b), (c, a)
> b, (b, c), (a, b)
> d, (d, e), (c, d)
>
> You will now want to pull out start, middle, and end. That's easy -- it's
> just ($2.$0, $0, $1.$1)
>
> So now, after a simple join and a foreach-generate, we have
>
> (c, a, b)
> (a, b, c)
> (c, d, e)
>
> Now we want to match, all segments where the start is equal to the end of
> another segment, and the middle is the beginning of that segment
>
> (c, a), (c, a, b), (a, b, c)
>
> Your triple is either $1 or $2 -- they are essentially the same.
>
> You might be able to do a group instead of a join for the last step, if
> certain conditions hold (for example, you must be guaranteed that this does
> not exist: a->b->c->b->a)
> In that case, you could sort the contents of the triples and group on the
> result, saving only those results that have > 1 entry in the group. This
> would be faster as you would need to shuffle only a single copy of the
> data.
>
> -D
>
> On Sat, Jun 12, 2010 at 10:39 PM, Renato Marroquín Mogrovejo <
> [email protected]> wrote:
>
> > Hi everybody, thanks again for the responses. I am really obtaining many
> > ideas (:
> >
> > About the depth of my data, yeah I am only looking for connected
> components
> > of depth three which I think will be a great help because my BFS would be
> > limited in the number of iterations.
> >
> > And about the example I must have these records x -> y, y -> z and z -> x
> > in
> > order to output the x, y, z result record. But there can probably be
> > elements who do not close the graph. For example, I might have a record
> > like
> > x -> b, and b doesn't references nobody, so I would have to ignore it. Is
> > that what you meant Tanton?
> > What about a self-join three times?Would that be a bad idea?
> > Thanks again.
> >
> > Renato M.
> >
> > 2010/6/11 Dmitriy Ryaboy <[email protected]>
> >
> > > hc, self-join should just work, but if it doesn't:
> > >
> > > split table into table_2 if 1==1, table3 if 1==1;
> > >
> > > OR
> > >
> > > table_2 = foreach table generate *;
> > > table3 = foreach table generate *;
> > >
> > > AND THEN
> > >
> > > T = join table by id1, table_2 by id2, table_3 by id3
> > >
> > > -D
> > >
> > > On Fri, Jun 11, 2010 at 10:59 AM, hc busy <[email protected]> wrote:
> > >
> > > > Yeah, that IS hard in pig. I'm not even sure how to do a self-join in
> > > Pig.
> > > > Like you can't really say
> > > >
> > > > T = join Table by id1, Table by id2, Table by id3;
> > > >
> > > > I think PigLatin will complain that it's confused which Table is and
> > > which
> > > > id1 goes with which table.
> > > >
> > > > I had been proposing that we allow PigLatin to allow recursive
> > functions,
> > > > that way we can do loops. But recursive functions doesn't fit in the
> > > > data-flow language paradigm.
> > > >
> > > > But I think people have offered many alternatives in terms of
> scripting
> > > > language that can wrap PigLatin that has loop and flow control
> > > constructs.
> > > >
> > > >
> > > > On Thu, Jun 10, 2010 at 9:55 PM, Tanton Gibbs <
> [email protected]
> > > > >wrote:
> > > >
> > > > > It really depends on the depth of your graph.  Are you only dealing
> > > > > with connected components of depth 3 or could they be deeper.
> > > > >
> > > > > For instance, can you have x -> y, y -> z, and z -> a?  If so, do
> you
> > > > > need your record to be x, y, z, a?  Or, are you guaranteed that it
> > > > > will always be x -> z and y -> z?
> > > > >
> > > > > I really need more information about your data before I can offer
> too
> > > > > much advice.
> > > > >
> > > > > On Thu, Jun 10, 2010 at 5:38 PM, Renato Marroquín Mogrovejo
> > > > > <[email protected]> wrote:
> > > > > > Hi everybody, thanks a lot for your responses.
> > > > > >
> > > > > > I am actually not looking for a transitive closure, I am not
> trying
> > > to
> > > > > > "infer" that given *x -> y* and *y -> z*, then *x -> z
> *(@hc.busy:
> > > that
> > > > > is
> > > > > > in short terms a transitive closure ^^)  because I have the data
> > that
> > > > > proves
> > > > > > it. And yeah I am aware of first logic expressive power, but
> maybe
> > I
> > > > will
> > > > > > think about giving an inference engine a try some time in the
> > future.
> > > > > >
> > > > > > I am actually looking at a connected graph-like problem. The
> sample
> > > > > records
> > > > > > resemble a bidirectional triangle.
> > > > > >
> > > > > >            990201
> > > > > >         /              \
> > > > > >        /                \
> > > > > >       /                  \
> > > > > >      /                    \
> > > > > > 770011  ------- 770083
> > > > > >
> > > > > > I tried using a smaller version of the problem by using SQL and
> got
> > a
> > > > > > horribly huge query which is a non-scalable possibility for me. I
> > > have
> > > > > over
> > > > > > 500Gbs in structured files. I used a triple join but I had to
> > > retrieve
> > > > > all
> > > > > > existing possibilities.
> > > > > > So that is why I thought on using something like Hc.busy
> mentioned.
> > > > > >
> > > > > > Queries: idx1, idx2, sc, va, p1, p2
> > > > > > possibilities: possibs = foreach queries generate idx1, flatten(a
> > > > triple
> > > > > > self join of the data)
> > > > > >
> > > > > > and by using the flattening command, get all possible
> combinations,
> > > but
> > > > I
> > > > > am
> > > > > > not sure that would be the correct approach. Anyhow I thought
> maybe
> > > > > > performing some breadth first search but that would give me an
> > > > algorithm
> > > > > > O(n2)  so  )=
> > > > > > Hey Tanton how would you implement and use an union-find
> structure?
> > > Do
> > > > > you
> > > > > > think it is possible with PIG?
> > > > > >
> > > > > > Thanks again.
> > > > > >
> > > > > >
> > > > > > 2010/6/10 Tanton Gibbs <[email protected]>
> > > > > >
> > > > > >> To be specific, he's looking for connected components in the
> > graph.
> > > > > >>
> > > > > >> It's not overly easy to do in an ETL tool (or in pig), but if
> you
> > > can
> > > > > >> wrap the script in a loop it is possible.
> > > > > >>
> > > > > >> There are actually some really interesting parallel algorithms
> for
> > > > > >> finding connected components.  If you know you are only going to
> > > have
> > > > > >> two keys, it is a bit simpler to code (but not any more
> > efficient).
> > > > > >> Essentially, you can implement your union-find algorithm as a
> > series
> > > > > >> of sorts and merges, which on a large parallel system is
> actually
> > > > > >> quite fast.
> > > > > >>
> > > > > >> Tanton
> > > > > >>
> > > > > >> On Thu, Jun 10, 2010 at 3:33 PM, hc busy <[email protected]>
> > wrote:
> > > > > >> > What's a transitive closure?
> > > > > >> >
> > > > > >> > On Thu, Jun 10, 2010 at 2:34 PM, Gianmarco <
> > > [email protected]
> > > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> >> I think what he wants is a transitive closure of the
> relation,
> > > > which
> > > > > >> >> is not achievable in SQL-like languages alone (first order
> > logic
> > > > > >> >> expressive power).
> > > > > >> >> I suppose Pig Latin falls in this category.
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> -- Gianmarco
> > > > > >> >>
> > > > > >> >> On Thu, Jun 10, 2010 at 19:54, hc busy <[email protected]>
> > > wrote:
> > > > > >> >> > Is this like a tricky interview question? I don't see the
> > > pattern
> > > > > >> between
> > > > > >> >> > those three numbers you listed and the sample of the table.
> > > > > >> >> >
> > > > > >> >> > 770011 770083 524 1e-120 89 12
> > > > > >> >> > 770083 770011 494 1e-120 39 100
> > > > > >> >> >
> > > > > >> >> > ahh, I guess these are related because id1=id2 an
> id2=id1...
> > > > Here's
> > > > > a
> > > > > >> >> first
> > > > > >> >> > pass at the problem. Project:
> > > > > >> >> >
> > > > > >> >> > P1 = foreach table generate id1 as id1, id2 as id2, *;
> > > > > >> >> > P2 = foreach table generate id2 as id1, id1 as id2, *;
> > > > > >> >> > J = join P1 by (id1, id2), P2 by (id1,id2);
> > > > > >> >> >
> > > > > >> >> > and now J contains pairs of rows from original table where
> > id1
> > > > and
> > > > > id2
> > > > > >> >> are
> > > > > >> >> > reversed.
> > > > > >> >> >
> > > > > >> >> > is this what you want?
> > > > > >> >> >
> > > > > >> >> > On Wed, Jun 9, 2010 at 6:54 PM, Renato Marroquín Mogrovejo
> <
> > > > > >> >> > [email protected]> wrote:
> > > > > >> >> >
> > > > > >> >> >> Hi everyone, today I came across with a particular query
> > that
> > > I
> > > > > don't
> > > > > >> >> know
> > > > > >> >> >> how to model in PIG. Part of my data looks like this:
> > > > > >> >> >>
> > > > > >> >> >> Id1 Id2 Sc Va P1 P2
> > > > > >> >> >> --------- --------- ----- --------- ----- ----
> > > > > >> >> >> 770011 990201 401 1e-125 100 65
> > > > > >> >> >> 990201 770011 440 1e-125 100 42
> > > > > >> >> >> 770011 770083 524 1e-120 89 12
> > > > > >> >> >> 770083 770011 494 1e-120 39 100
> > > > > >> >> >> 990201 770083 341 1e-125 73 41
> > > > > >> >> >> 770083 990201 421 1e-125 90 85
> > > > > >> >> >> .
> > > > > >> >> >> .
> > > > > >> >> >> .
> > > > > >> >> >>
> > > > > >> >> >> what I would like to retrieve is something like
> > > > > >> >> >> this:                                             770011
> > > 990201
> > > > > >> 770083
> > > > > >> >> >> because they are records actually related.
> > > > > >> >> >> Any kind of ideas are highly appreciated. Thanks in
> > advanced.
> > > > > >> >> >>
> > > > > >> >> >> Renato M.
> > > > > >> >> >>
> > > > > >> >> >
> > > > > >> >>
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Help with a tricky query

Reply via email to