Rob, I get confused how the fields are named sometimes, especially after a series of joins and groups. The describe command helps (and the illustrate command, which unfortunately doesn't support some operators).
On Tue, Jan 12, 2010 at 3:27 PM, Mridul Muralidharan <[email protected]> wrote: > > As a suffix to what Dmitriy described - just add a project to pick the > columns you need. > c = join a by filename, b by filename PARALLEL $MY_PARALLELISM; > --- Please check this syntax though with pig latin docs. > d = foreach c generate a::filename; --- Or anything else you want to pick. > > if you need, just do a distinct of d's output to remove duplicates ... > though this might result in more MR jobs. > > > - Mridul > > > > Rob Stewart wrote: >> >> Hi, yeah I thought so, >> >> the only slightly confusing issue is that the output would be: >> bar.dat bar.dat >> >> ? (i.e. - showing you a.filename b.filename ) ? >> >> Rob. >> >> >> >> 2010/1/12 Dmitriy Ryaboy <[email protected]> >> >>> Rob, it's just a join. >>> >>> a = load 'rel1' using FooStorage() as (id, filename); >>> b = load 'rel2' using FooStorage() as (id, filename); >>> c = join a by filename, b by filename; >>> >>> Rows that don't match won't make it. >>> If you DO want them to make it in, you need to use "outer" for the >>> relations whose non-matching rows you want retained (the rest of the >>> fields in the resulting relation will be filled in with nulls). >>> >>> Naturally, since Pig can do it, MR can do it. >>> >>> -D >>> >>> On Tue, Jan 12, 2010 at 2:57 PM, Rob Stewart >>> <[email protected]> wrote: >>>> >>>> Hi folks, >>>> >>>> I have a somewhat obvious question, that needs asking (for my sakes). >>>> >>>> Pig can do Joins, I realise that. But take for example: >>>> Table_1 >>>> ---------------------- >>>> | ID | fileName | >>>> 1 foo.dat >>>> 2 bar.dat >>>> 3 harry.dat >>>> >>>> Table_2 >>>> ---------------------- >>>> | ID | fileName | >>>> 1 tom.dat >>>> 2 bar.dat >>>> 3 gamma.dat >>>> >>>> >>>> SQL Syntax for conditional select: >>>> "select t1.fileName from Table_1 t1, Table_2 t2 where t1.fileName = >>>> t2.fileName" >>>> >>>> Result >>>> -------- >>>> bar.dat >>>> >>>> How is such a query represented in Pig? >>>> tableOne = LOAD 'input1.dat' USING PigStorage() AS (id:int, >>>> filename:chararray); >>>> tableTwo = LOAD 'input2.dat' USING PigStorage() AS (id:int, >>>> filename:chararray); >>>> [Now what??] >>>> STORE query INTO 'Output.pig' USING PigStorage(); >>>> >>>> >>>> As a bonus question, can anybody tell me if this sort of conditional >>> >>> select >>>> >>>> query is possible writing in Java MapReduce? >>>> >>>> thanks, >>>> >>>> Rob Stewart >>>> > >
