Hahaha, Thanks for your help, I am not at all confused and when I am, I will come back to this rocking "pig-user" group ;)
On 5/7/10 9:46 AM, "Dmitriy Ryaboy" <[email protected]> wrote: > "clear as mud" generally means totally unclear (mud being, well, > muddy). So now I am confused as to whether I confused you or not :) > > Regarding the relations -- only relations that are stored are actually > materialized; the rest are computed and discarded on the fly. If you > use multi-store, you can write a long script with multiple stores, and > the execution plan will be constructed in such a way that when > possible, work is reused. The grunt shell doesn't give you this, as it > is interactive, and executes dumps and stores as soon as you enter > them -- so it can't delay work until it knows all the things it will > need to compute. > > -Dmitriy > > On Fri, May 7, 2010 at 9:26 AM, Syed Wasti <[email protected]> wrote: >> Clear as mud, nice explanation. Thanks. >> >> Mean while I want to throw another basic question on how relations work, >> After you DUMP or STORE the results, what happens to all the relations >> before DUMP or STORE. In a grunt shell I can still use the relations and do >> something else with them. Are all the relations saved for that session ? >> OR is the data for relations saved under tmp ? >> Need some clarity on this. >> >> >> On 5/6/10 10:05 PM, "Dmitriy Ryaboy" <[email protected]> wrote: >> >>> It's id => group, org_type => research.org_type and dept_type => >>> research.dept_type >>> But that still doesn't get you where you want to go.. read on. >>> >>> Here's a quick explanation. >>> When you group a relation, the result is a new relation with two >>> columns: "group" and the name of the original relation. >>> The group column has the schema of what you grouped by. If you grouped >>> by an integer column, for example, the type will be int. >>> If you grouped by a tuple of several columns -- "foo = group bar by >>> (a, b);" -- the "group" column will be a tuple with two fields, a and >>> b. >>> They can be retrieved by flattening "group", or by directly accessing >>> them: "group.a, group.b". >>> >>> The second column will be named after the original relation, and >>> contain a *bag* of all the rows in the original relation that match >>> the corresponding group. The rows are unaltered -- so they still >>> contain columns you grouped by, for example. The way bags work is, >>> referring to mybag.some_field essentially means "for each tuple in the >>> bag, give me some_field in that tuple". So you can do things like >>> >>> group_counts = foreach grouped_data generate >>> group as key, -- the key you grouped on >>> COUNT(original_data), -- the number of rows that have that key >>> MAX(original_data.cost); -- the maximum cost of items that have that key >>> >>> Note that all the functions in this example are aggregates. That's >>> because they are things we can do to a *collection of values*. It >>> doesn't make sense to do something like what you are trying -- check >>> if some field is equal to S, for example. That's because there may be >>> multiple rows in your original data that have the same id, but >>> different org_type. What can pig give you when it encounters a row >>> like below (I represent it with field names for clarity): >>> >>> ( group=1, { ( id=1, org_type='S', cnt=3 ), ( id=1, org_type='L", cnt=4)} >>> >>> Should it interpret org_type as S, or L? >>> >>> I suspect that what you actually want is something like this: >>> >>> -- load >>> research = LOAD 'count' >>> USING PigStorage('\t') >>> as (id: long, org_type: chararray, dept_type: int, >>> cnt: int, cnt_distinct: int); >>> >>> -- break out the counts >>> research = FOREACH research >>> GENERATE id, >>> (org_type == 'S' AND dept_type == 1?cnt:0) AS sale, >>> (org_type == 'S' AND dept_type == 2?cnt:0) AS saleEx, >>> (org_type == 'C' AND dept_type == 2?cnt:0) AS acc_grp, >>> (org_type == 'C' AND dept_type == 4?cnt:0) AS >>> acc_grp_manual, >>> (org_type == 'S' AND dept_type == 1?cnt_distinct:0) AS >>> sale_distinct, >>> (org_type == 'S' AND dept_type == 2?cnt_distinct:0) AS >>> saleEx_distinct, >>> (org_type == 'C' AND dept_type == 2?cnt_distinct:0) AS >>> acc_grp_distinct, >>> (org_type == 'C' AND dept_type == 4?cnt_distinct:0) AS >>> accM_grp_distinct; >>> >>> -- group and generate >>> grouped = group research by id; >>> final_data = FOREACH grouped GENERATE group as id, SUM(research.sale) >>> as total_sales, SUM(research.saleEx) as total_saleEx, .....; >>> >>> Or perhaps you want the pivot of that: >>> >>> research = LOAD.... >>> grouped = group research by (id, org_type, dept_type); >>> final_data = FOREACH grouped GENERATE >>> FLATTEN(group) as (id, org_type, dept_type), >>> (group.org_type == 'S' AND group.dept_type == 1? SUM(research.cnt) :0) AS >>> sale, >>> ..... >>> >>> Hope this helps. >>> >>> -Dmitriy >>> >>> On Thu, May 6, 2010 at 9:33 PM, Syed Wasti <[email protected]> wrote: >>>> Hi, >>>> Trying to make my script execute happily, it wont stop throwing errors. >>>> It works if I don¹t group my data. But once I group, >>>> It starts with; >>>> ERROR 1000: Error during parsing. Invalid alias: id in {group: >>>> long,edu_affl: {id >>>> I got rid of it, then it complains about Invalid alias: org_type, >>>> It keeps going and at the end it complains about mismatch in comparison, as >>>> the left side is a bag and right side is chararray. How ? >>>> >>>> Not sure, where I am going wrong, need help; >>>> >>>> >>>> research = LOAD 'count' >>>> USING PigStorage('\t') >>>> as (id: long, org_type: chararray, dept_type: int, >>>> cnt: int, cnt_distinct: int); >>>> >>>> final_grp = GROUP research BY id; >>>> >>>> Final_table = FOREACH final_grp >>>> GENERATE id, >>>> (org_type == 'S' AND dept_type == 1?cnt:0) AS sale, >>>> (org_type == 'S' AND dept_type == 2?cnt:0) AS saleEx, >>>> (org_type == 'C' AND dept_type == 2?cnt:0) AS acc_grp, >>>> (org_type == 'C' AND dept_type == 4?cnt:0) AS >>>> acc_grp_manual, >>>> (org_type == 'S' AND dept_type == 1?cnt_distinct:0) AS >>>> sale_distinct, >>>> (org_type == 'S' AND dept_type == 2?cnt_distinct:0) AS >>>> saleEx_distinct, >>>> (org_type == 'C' AND dept_type == 2?cnt_distinct:0) AS >>>> acc_grp_distinct, >>>> (org_type == 'C' AND dept_type == 4?cnt_distinct:0) AS >>>> accM_grp_distinct; >>>> >>>> STORE final_table INTO 'final_count' USING PigStorage(); >>>> >>>> Regards >>>> Syed >>>> >>> >> >> >> >
