Clear as mud, nice explanation. Thanks. Mean while I want to throw another basic question on how relations work, After you DUMP or STORE the results, what happens to all the relations before DUMP or STORE. In a grunt shell I can still use the relations and do something else with them. Are all the relations saved for that session ? OR is the data for relations saved under tmp ? Need some clarity on this.
On 5/6/10 10:05 PM, "Dmitriy Ryaboy" <[email protected]> wrote: > It's id => group, org_type => research.org_type and dept_type => > research.dept_type > But that still doesn't get you where you want to go.. read on. > > Here's a quick explanation. > When you group a relation, the result is a new relation with two > columns: "group" and the name of the original relation. > The group column has the schema of what you grouped by. If you grouped > by an integer column, for example, the type will be int. > If you grouped by a tuple of several columns -- "foo = group bar by > (a, b);" -- the "group" column will be a tuple with two fields, a and > b. > They can be retrieved by flattening "group", or by directly accessing > them: "group.a, group.b". > > The second column will be named after the original relation, and > contain a *bag* of all the rows in the original relation that match > the corresponding group. The rows are unaltered -- so they still > contain columns you grouped by, for example. The way bags work is, > referring to mybag.some_field essentially means "for each tuple in the > bag, give me some_field in that tuple". So you can do things like > > group_counts = foreach grouped_data generate > group as key, -- the key you grouped on > COUNT(original_data), -- the number of rows that have that key > MAX(original_data.cost); -- the maximum cost of items that have that key > > Note that all the functions in this example are aggregates. That's > because they are things we can do to a *collection of values*. It > doesn't make sense to do something like what you are trying -- check > if some field is equal to S, for example. That's because there may be > multiple rows in your original data that have the same id, but > different org_type. What can pig give you when it encounters a row > like below (I represent it with field names for clarity): > > ( group=1, { ( id=1, org_type='S', cnt=3 ), ( id=1, org_type='L", cnt=4)} > > Should it interpret org_type as S, or L? > > I suspect that what you actually want is something like this: > > -- load > research = LOAD 'count' > USING PigStorage('\t') > as (id: long, org_type: chararray, dept_type: int, > cnt: int, cnt_distinct: int); > > -- break out the counts > research = FOREACH research > GENERATE id, > (org_type == 'S' AND dept_type == 1?cnt:0) AS sale, > (org_type == 'S' AND dept_type == 2?cnt:0) AS saleEx, > (org_type == 'C' AND dept_type == 2?cnt:0) AS acc_grp, > (org_type == 'C' AND dept_type == 4?cnt:0) AS acc_grp_manual, > (org_type == 'S' AND dept_type == 1?cnt_distinct:0) AS > sale_distinct, > (org_type == 'S' AND dept_type == 2?cnt_distinct:0) AS > saleEx_distinct, > (org_type == 'C' AND dept_type == 2?cnt_distinct:0) AS > acc_grp_distinct, > (org_type == 'C' AND dept_type == 4?cnt_distinct:0) AS > accM_grp_distinct; > > -- group and generate > grouped = group research by id; > final_data = FOREACH grouped GENERATE group as id, SUM(research.sale) > as total_sales, SUM(research.saleEx) as total_saleEx, .....; > > Or perhaps you want the pivot of that: > > research = LOAD.... > grouped = group research by (id, org_type, dept_type); > final_data = FOREACH grouped GENERATE > FLATTEN(group) as (id, org_type, dept_type), > (group.org_type == 'S' AND group.dept_type == 1? SUM(research.cnt) :0) AS > sale, > ..... > > Hope this helps. > > -Dmitriy > > On Thu, May 6, 2010 at 9:33 PM, Syed Wasti <[email protected]> wrote: >> Hi, >> Trying to make my script execute happily, it wont stop throwing errors. >> It works if I don¹t group my data. But once I group, >> It starts with; >> ERROR 1000: Error during parsing. Invalid alias: id in {group: >> long,edu_affl: {id >> I got rid of it, then it complains about Invalid alias: org_type, >> It keeps going and at the end it complains about mismatch in comparison, as >> the left side is a bag and right side is chararray. How ? >> >> Not sure, where I am going wrong, need help; >> >> >> research = LOAD 'count' >> USING PigStorage('\t') >> as (id: long, org_type: chararray, dept_type: int, >> cnt: int, cnt_distinct: int); >> >> final_grp = GROUP research BY id; >> >> Final_table = FOREACH final_grp >> GENERATE id, >> (org_type == 'S' AND dept_type == 1?cnt:0) AS sale, >> (org_type == 'S' AND dept_type == 2?cnt:0) AS saleEx, >> (org_type == 'C' AND dept_type == 2?cnt:0) AS acc_grp, >> (org_type == 'C' AND dept_type == 4?cnt:0) AS >> acc_grp_manual, >> (org_type == 'S' AND dept_type == 1?cnt_distinct:0) AS >> sale_distinct, >> (org_type == 'S' AND dept_type == 2?cnt_distinct:0) AS >> saleEx_distinct, >> (org_type == 'C' AND dept_type == 2?cnt_distinct:0) AS >> acc_grp_distinct, >> (org_type == 'C' AND dept_type == 4?cnt_distinct:0) AS >> accM_grp_distinct; >> >> STORE final_table INTO 'final_count' USING PigStorage(); >> >> Regards >> Syed >> >
