Re: GROUP Error

Syed Wasti Fri, 07 May 2010 11:05:32 -0700

Hahaha, Thanks for your help, I am not at all confused and when I am, I will
come back to this rocking "pig-user" group ;)



On 5/7/10 9:46 AM, "Dmitriy Ryaboy" <[email protected]> wrote:

> "clear as mud" generally means totally unclear (mud being, well,
> muddy). So now I am confused as to whether I confused you or not :)
> 
> Regarding the relations -- only relations that are stored are actually
> materialized; the rest are computed and discarded on the fly. If you
> use multi-store, you can write a long script with multiple stores, and
> the execution plan will be constructed in such a way that when
> possible, work is reused. The grunt shell doesn't give you this, as it
> is interactive, and executes dumps and stores as soon as you enter
> them -- so it can't delay work until it knows all the things it will
> need to compute.
> 
> -Dmitriy
> 
> On Fri, May 7, 2010 at 9:26 AM, Syed Wasti <[email protected]> wrote:
>> Clear as mud, nice explanation. Thanks.
>> 
>> Mean while I want to throw another basic question on how relations work,
>> After you DUMP or STORE the results, what happens to all the relations
>> before DUMP or STORE. In a grunt shell I can still use the relations and do
>> something else with them. Are all the relations saved for that session ?
>> OR is the data for relations saved under tmp ?
>> Need some clarity on this.
>> 
>> 
>> On 5/6/10 10:05 PM, "Dmitriy Ryaboy" <[email protected]> wrote:
>> 
>>> It's id => group, org_type => research.org_type and dept_type =>
>>> research.dept_type
>>> But that still doesn't get you where you want to go.. read on.
>>> 
>>> Here's a quick explanation.
>>> When you group a relation, the result is a new relation with two
>>> columns: "group" and the name of the original relation.
>>> The group column has the schema of what you grouped by. If you grouped
>>> by an integer column, for example, the type will be int.
>>> If you grouped by a tuple of several columns -- "foo = group bar by
>>> (a, b);" -- the "group" column will be a tuple with two fields, a and
>>> b.
>>> They can be retrieved by flattening "group", or by directly accessing
>>> them: "group.a, group.b".
>>> 
>>> The second column will be named after the original relation, and
>>> contain a *bag* of all the rows in the original relation that match
>>> the corresponding group. The rows are unaltered -- so they still
>>> contain columns you grouped by, for example.  The way bags work is,
>>> referring to mybag.some_field essentially means "for each tuple in the
>>> bag, give me some_field in that tuple".  So you can do things like
>>> 
>>> group_counts = foreach grouped_data generate
>>>   group as key,  -- the key you grouped on
>>>   COUNT(original_data), -- the number of rows that have that key
>>>   MAX(original_data.cost); -- the maximum cost of items that have that key
>>> 
>>> Note that all the functions in this example are aggregates. That's
>>> because they are things we can do to a *collection of values*. It
>>> doesn't make sense to do something like what you are trying -- check
>>> if some field is equal to S, for example.  That's because there may be
>>> multiple rows in your original data that have the same id, but
>>> different org_type. What can pig give you when it encounters a row
>>> like below (I represent it with field names for clarity):
>>> 
>>> ( group=1, { (  id=1, org_type='S', cnt=3 ), ( id=1, org_type='L", cnt=4)}
>>> 
>>> Should it interpret org_type as S, or L?
>>> 
>>> I suspect that what you actually want is something like this:
>>> 
>>> -- load
>>> research =      LOAD 'count'
>>>                 USING PigStorage('\t')
>>>                 as (id: long, org_type: chararray, dept_type: int,
>>>                 cnt: int, cnt_distinct: int);
>>> 
>>> -- break out the counts
>>> research = FOREACH research
>>>                 GENERATE id,
>>>                 (org_type == 'S' AND dept_type == 1?cnt:0) AS sale,
>>>                 (org_type == 'S' AND dept_type == 2?cnt:0) AS saleEx,
>>>                 (org_type == 'C' AND dept_type == 2?cnt:0) AS acc_grp,
>>>                 (org_type == 'C' AND dept_type == 4?cnt:0) AS
>>> acc_grp_manual,
>>>                 (org_type == 'S' AND dept_type == 1?cnt_distinct:0) AS
>>> sale_distinct,
>>>                 (org_type == 'S' AND dept_type == 2?cnt_distinct:0) AS
>>> saleEx_distinct,
>>>                 (org_type == 'C' AND dept_type == 2?cnt_distinct:0) AS
>>> acc_grp_distinct,
>>>                 (org_type == 'C' AND dept_type == 4?cnt_distinct:0) AS
>>> accM_grp_distinct;
>>> 
>>> -- group and generate
>>> grouped = group research by id;
>>> final_data = FOREACH grouped GENERATE group as id, SUM(research.sale)
>>> as total_sales, SUM(research.saleEx) as total_saleEx, .....;
>>> 
>>> Or perhaps you want the pivot of that:
>>> 
>>> research = LOAD....
>>> grouped = group research by (id, org_type, dept_type);
>>> final_data = FOREACH grouped GENERATE
>>>   FLATTEN(group) as (id, org_type, dept_type),
>>>  (group.org_type == 'S' AND group.dept_type == 1? SUM(research.cnt) :0) AS
>>> sale,
>>> .....
>>> 
>>> Hope this helps.
>>> 
>>> -Dmitriy
>>> 
>>> On Thu, May 6, 2010 at 9:33 PM, Syed Wasti <[email protected]> wrote:
>>>> Hi,
>>>> Trying to make my script execute happily, it wont stop throwing errors.
>>>> It works if I don¹t group my data. But once I group,
>>>> It starts with;
>>>> ERROR 1000: Error during parsing. Invalid alias: id  in {group:
>>>> long,edu_affl: {id
>>>> I got rid of it,  then it complains about Invalid alias: org_type,
>>>> It keeps going and at the end it complains about mismatch in comparison, as
>>>> the left side is a bag and right side is chararray. How ?
>>>> 
>>>> Not sure, where I am going wrong, need help;
>>>> 
>>>> 
>>>> research =      LOAD 'count'
>>>>                USING PigStorage('\t')
>>>>                as (id: long, org_type: chararray, dept_type: int,
>>>>                cnt: int, cnt_distinct: int);
>>>> 
>>>> final_grp =     GROUP research BY id;
>>>> 
>>>> Final_table =   FOREACH final_grp
>>>>                GENERATE id,
>>>>                (org_type == 'S' AND dept_type == 1?cnt:0) AS sale,
>>>>                (org_type == 'S' AND dept_type == 2?cnt:0) AS saleEx,
>>>>                (org_type == 'C' AND dept_type == 2?cnt:0) AS acc_grp,
>>>>                (org_type == 'C' AND dept_type == 4?cnt:0) AS
>>>> acc_grp_manual,
>>>>                (org_type == 'S' AND dept_type == 1?cnt_distinct:0) AS
>>>> sale_distinct,
>>>>                (org_type == 'S' AND dept_type == 2?cnt_distinct:0) AS
>>>> saleEx_distinct,
>>>>                (org_type == 'C' AND dept_type == 2?cnt_distinct:0) AS
>>>> acc_grp_distinct,
>>>>                (org_type == 'C' AND dept_type == 4?cnt_distinct:0) AS
>>>> accM_grp_distinct;
>>>> 
>>>> STORE final_table INTO 'final_count' USING PigStorage();
>>>> 
>>>> Regards
>>>> Syed
>>>> 
>>> 
>> 
>> 
>> 
>

Re: GROUP Error

Reply via email to