Clear as mud, nice explanation. Thanks.

Mean while I want to throw another basic question on how relations work,
After you DUMP or STORE the results, what happens to all the relations
before DUMP or STORE. In a grunt shell I can still use the relations and do
something else with them. Are all the relations saved for that session ?
OR is the data for relations saved under tmp ?
Need some clarity on this.


On 5/6/10 10:05 PM, "Dmitriy Ryaboy" <[email protected]> wrote:

> It's id => group, org_type => research.org_type and dept_type =>
> research.dept_type
> But that still doesn't get you where you want to go.. read on.
> 
> Here's a quick explanation.
> When you group a relation, the result is a new relation with two
> columns: "group" and the name of the original relation.
> The group column has the schema of what you grouped by. If you grouped
> by an integer column, for example, the type will be int.
> If you grouped by a tuple of several columns -- "foo = group bar by
> (a, b);" -- the "group" column will be a tuple with two fields, a and
> b.
> They can be retrieved by flattening "group", or by directly accessing
> them: "group.a, group.b".
> 
> The second column will be named after the original relation, and
> contain a *bag* of all the rows in the original relation that match
> the corresponding group. The rows are unaltered -- so they still
> contain columns you grouped by, for example.  The way bags work is,
> referring to mybag.some_field essentially means "for each tuple in the
> bag, give me some_field in that tuple".  So you can do things like
> 
> group_counts = foreach grouped_data generate
>   group as key,  -- the key you grouped on
>   COUNT(original_data), -- the number of rows that have that key
>   MAX(original_data.cost); -- the maximum cost of items that have that key
> 
> Note that all the functions in this example are aggregates. That's
> because they are things we can do to a *collection of values*. It
> doesn't make sense to do something like what you are trying -- check
> if some field is equal to S, for example.  That's because there may be
> multiple rows in your original data that have the same id, but
> different org_type. What can pig give you when it encounters a row
> like below (I represent it with field names for clarity):
> 
> ( group=1, { (  id=1, org_type='S', cnt=3 ), ( id=1, org_type='L", cnt=4)}
> 
> Should it interpret org_type as S, or L?
> 
> I suspect that what you actually want is something like this:
> 
> -- load
> research =      LOAD 'count'
>                 USING PigStorage('\t')
>                 as (id: long, org_type: chararray, dept_type: int,
>                 cnt: int, cnt_distinct: int);
> 
> -- break out the counts
> research = FOREACH research
>                 GENERATE id,
>                 (org_type == 'S' AND dept_type == 1?cnt:0) AS sale,
>                 (org_type == 'S' AND dept_type == 2?cnt:0) AS saleEx,
>                 (org_type == 'C' AND dept_type == 2?cnt:0) AS acc_grp,
>                 (org_type == 'C' AND dept_type == 4?cnt:0) AS acc_grp_manual,
>                 (org_type == 'S' AND dept_type == 1?cnt_distinct:0) AS
> sale_distinct,
>                 (org_type == 'S' AND dept_type == 2?cnt_distinct:0) AS
> saleEx_distinct,
>                 (org_type == 'C' AND dept_type == 2?cnt_distinct:0) AS
> acc_grp_distinct,
>                 (org_type == 'C' AND dept_type == 4?cnt_distinct:0) AS
> accM_grp_distinct;
> 
> -- group and generate
> grouped = group research by id;
> final_data = FOREACH grouped GENERATE group as id, SUM(research.sale)
> as total_sales, SUM(research.saleEx) as total_saleEx, .....;
> 
> Or perhaps you want the pivot of that:
> 
> research = LOAD....
> grouped = group research by (id, org_type, dept_type);
> final_data = FOREACH grouped GENERATE
>   FLATTEN(group) as (id, org_type, dept_type),
>  (group.org_type == 'S' AND group.dept_type == 1? SUM(research.cnt) :0) AS
> sale,
> .....
> 
> Hope this helps.
> 
> -Dmitriy
> 
> On Thu, May 6, 2010 at 9:33 PM, Syed Wasti <[email protected]> wrote:
>> Hi,
>> Trying to make my script execute happily, it wont stop throwing errors.
>> It works if I don¹t group my data. But once I group,
>> It starts with;
>> ERROR 1000: Error during parsing. Invalid alias: id  in {group:
>> long,edu_affl: {id
>> I got rid of it,  then it complains about Invalid alias: org_type,
>> It keeps going and at the end it complains about mismatch in comparison, as
>> the left side is a bag and right side is chararray. How ?
>> 
>> Not sure, where I am going wrong, need help;
>> 
>> 
>> research =      LOAD 'count'
>>                USING PigStorage('\t')
>>                as (id: long, org_type: chararray, dept_type: int,
>>                cnt: int, cnt_distinct: int);
>> 
>> final_grp =     GROUP research BY id;
>> 
>> Final_table =   FOREACH final_grp
>>                GENERATE id,
>>                (org_type == 'S' AND dept_type == 1?cnt:0) AS sale,
>>                (org_type == 'S' AND dept_type == 2?cnt:0) AS saleEx,
>>                (org_type == 'C' AND dept_type == 2?cnt:0) AS acc_grp,
>>                (org_type == 'C' AND dept_type == 4?cnt:0) AS
>> acc_grp_manual,
>>                (org_type == 'S' AND dept_type == 1?cnt_distinct:0) AS
>> sale_distinct,
>>                (org_type == 'S' AND dept_type == 2?cnt_distinct:0) AS
>> saleEx_distinct,
>>                (org_type == 'C' AND dept_type == 2?cnt_distinct:0) AS
>> acc_grp_distinct,
>>                (org_type == 'C' AND dept_type == 4?cnt_distinct:0) AS
>> accM_grp_distinct;
>> 
>> STORE final_table INTO 'final_count' USING PigStorage();
>> 
>> Regards
>> Syed
>> 
> 


Reply via email to