"clear as mud" generally means totally unclear (mud being, well, muddy). So now I am confused as to whether I confused you or not :)
Regarding the relations -- only relations that are stored are actually materialized; the rest are computed and discarded on the fly. If you use multi-store, you can write a long script with multiple stores, and the execution plan will be constructed in such a way that when possible, work is reused. The grunt shell doesn't give you this, as it is interactive, and executes dumps and stores as soon as you enter them -- so it can't delay work until it knows all the things it will need to compute. -Dmitriy On Fri, May 7, 2010 at 9:26 AM, Syed Wasti <[email protected]> wrote: > Clear as mud, nice explanation. Thanks. > > Mean while I want to throw another basic question on how relations work, > After you DUMP or STORE the results, what happens to all the relations > before DUMP or STORE. In a grunt shell I can still use the relations and do > something else with them. Are all the relations saved for that session ? > OR is the data for relations saved under tmp ? > Need some clarity on this. > > > On 5/6/10 10:05 PM, "Dmitriy Ryaboy" <[email protected]> wrote: > >> It's id => group, org_type => research.org_type and dept_type => >> research.dept_type >> But that still doesn't get you where you want to go.. read on. >> >> Here's a quick explanation. >> When you group a relation, the result is a new relation with two >> columns: "group" and the name of the original relation. >> The group column has the schema of what you grouped by. If you grouped >> by an integer column, for example, the type will be int. >> If you grouped by a tuple of several columns -- "foo = group bar by >> (a, b);" -- the "group" column will be a tuple with two fields, a and >> b. >> They can be retrieved by flattening "group", or by directly accessing >> them: "group.a, group.b". >> >> The second column will be named after the original relation, and >> contain a *bag* of all the rows in the original relation that match >> the corresponding group. The rows are unaltered -- so they still >> contain columns you grouped by, for example. The way bags work is, >> referring to mybag.some_field essentially means "for each tuple in the >> bag, give me some_field in that tuple". So you can do things like >> >> group_counts = foreach grouped_data generate >> group as key, -- the key you grouped on >> COUNT(original_data), -- the number of rows that have that key >> MAX(original_data.cost); -- the maximum cost of items that have that key >> >> Note that all the functions in this example are aggregates. That's >> because they are things we can do to a *collection of values*. It >> doesn't make sense to do something like what you are trying -- check >> if some field is equal to S, for example. That's because there may be >> multiple rows in your original data that have the same id, but >> different org_type. What can pig give you when it encounters a row >> like below (I represent it with field names for clarity): >> >> ( group=1, { ( id=1, org_type='S', cnt=3 ), ( id=1, org_type='L", cnt=4)} >> >> Should it interpret org_type as S, or L? >> >> I suspect that what you actually want is something like this: >> >> -- load >> research = LOAD 'count' >> USING PigStorage('\t') >> as (id: long, org_type: chararray, dept_type: int, >> cnt: int, cnt_distinct: int); >> >> -- break out the counts >> research = FOREACH research >> GENERATE id, >> (org_type == 'S' AND dept_type == 1?cnt:0) AS sale, >> (org_type == 'S' AND dept_type == 2?cnt:0) AS saleEx, >> (org_type == 'C' AND dept_type == 2?cnt:0) AS acc_grp, >> (org_type == 'C' AND dept_type == 4?cnt:0) AS acc_grp_manual, >> (org_type == 'S' AND dept_type == 1?cnt_distinct:0) AS >> sale_distinct, >> (org_type == 'S' AND dept_type == 2?cnt_distinct:0) AS >> saleEx_distinct, >> (org_type == 'C' AND dept_type == 2?cnt_distinct:0) AS >> acc_grp_distinct, >> (org_type == 'C' AND dept_type == 4?cnt_distinct:0) AS >> accM_grp_distinct; >> >> -- group and generate >> grouped = group research by id; >> final_data = FOREACH grouped GENERATE group as id, SUM(research.sale) >> as total_sales, SUM(research.saleEx) as total_saleEx, .....; >> >> Or perhaps you want the pivot of that: >> >> research = LOAD.... >> grouped = group research by (id, org_type, dept_type); >> final_data = FOREACH grouped GENERATE >> FLATTEN(group) as (id, org_type, dept_type), >> (group.org_type == 'S' AND group.dept_type == 1? SUM(research.cnt) :0) AS >> sale, >> ..... >> >> Hope this helps. >> >> -Dmitriy >> >> On Thu, May 6, 2010 at 9:33 PM, Syed Wasti <[email protected]> wrote: >>> Hi, >>> Trying to make my script execute happily, it wont stop throwing errors. >>> It works if I don¹t group my data. But once I group, >>> It starts with; >>> ERROR 1000: Error during parsing. Invalid alias: id in {group: >>> long,edu_affl: {id >>> I got rid of it, then it complains about Invalid alias: org_type, >>> It keeps going and at the end it complains about mismatch in comparison, as >>> the left side is a bag and right side is chararray. How ? >>> >>> Not sure, where I am going wrong, need help; >>> >>> >>> research = LOAD 'count' >>> USING PigStorage('\t') >>> as (id: long, org_type: chararray, dept_type: int, >>> cnt: int, cnt_distinct: int); >>> >>> final_grp = GROUP research BY id; >>> >>> Final_table = FOREACH final_grp >>> GENERATE id, >>> (org_type == 'S' AND dept_type == 1?cnt:0) AS sale, >>> (org_type == 'S' AND dept_type == 2?cnt:0) AS saleEx, >>> (org_type == 'C' AND dept_type == 2?cnt:0) AS acc_grp, >>> (org_type == 'C' AND dept_type == 4?cnt:0) AS >>> acc_grp_manual, >>> (org_type == 'S' AND dept_type == 1?cnt_distinct:0) AS >>> sale_distinct, >>> (org_type == 'S' AND dept_type == 2?cnt_distinct:0) AS >>> saleEx_distinct, >>> (org_type == 'C' AND dept_type == 2?cnt_distinct:0) AS >>> acc_grp_distinct, >>> (org_type == 'C' AND dept_type == 4?cnt_distinct:0) AS >>> accM_grp_distinct; >>> >>> STORE final_table INTO 'final_count' USING PigStorage(); >>> >>> Regards >>> Syed >>> >> > > >
