Re: GROUP Error

Dmitriy Ryaboy Fri, 07 May 2010 09:47:07 -0700

"clear as mud" generally means totally unclear (mud being, well,
muddy). So now I am confused as to whether I confused you or not :)


Regarding the relations -- only relations that are stored are actually
materialized; the rest are computed and discarded on the fly. If you
use multi-store, you can write a long script with multiple stores, and
the execution plan will be constructed in such a way that when
possible, work is reused. The grunt shell doesn't give you this, as it
is interactive, and executes dumps and stores as soon as you enter
them -- so it can't delay work until it knows all the things it will
need to compute.

-Dmitriy

On Fri, May 7, 2010 at 9:26 AM, Syed Wasti <[email protected]> wrote:
> Clear as mud, nice explanation. Thanks.
>
> Mean while I want to throw another basic question on how relations work,
> After you DUMP or STORE the results, what happens to all the relations
> before DUMP or STORE. In a grunt shell I can still use the relations and do
> something else with them. Are all the relations saved for that session ?
> OR is the data for relations saved under tmp ?
> Need some clarity on this.
>
>
> On 5/6/10 10:05 PM, "Dmitriy Ryaboy" <[email protected]> wrote:
>
>> It's id => group, org_type => research.org_type and dept_type =>
>> research.dept_type
>> But that still doesn't get you where you want to go.. read on.
>>
>> Here's a quick explanation.
>> When you group a relation, the result is a new relation with two
>> columns: "group" and the name of the original relation.
>> The group column has the schema of what you grouped by. If you grouped
>> by an integer column, for example, the type will be int.
>> If you grouped by a tuple of several columns -- "foo = group bar by
>> (a, b);" -- the "group" column will be a tuple with two fields, a and
>> b.
>> They can be retrieved by flattening "group", or by directly accessing
>> them: "group.a, group.b".
>>
>> The second column will be named after the original relation, and
>> contain a *bag* of all the rows in the original relation that match
>> the corresponding group. The rows are unaltered -- so they still
>> contain columns you grouped by, for example.  The way bags work is,
>> referring to mybag.some_field essentially means "for each tuple in the
>> bag, give me some_field in that tuple".  So you can do things like
>>
>> group_counts = foreach grouped_data generate
>>   group as key,  -- the key you grouped on
>>   COUNT(original_data), -- the number of rows that have that key
>>   MAX(original_data.cost); -- the maximum cost of items that have that key
>>
>> Note that all the functions in this example are aggregates. That's
>> because they are things we can do to a *collection of values*. It
>> doesn't make sense to do something like what you are trying -- check
>> if some field is equal to S, for example.  That's because there may be
>> multiple rows in your original data that have the same id, but
>> different org_type. What can pig give you when it encounters a row
>> like below (I represent it with field names for clarity):
>>
>> ( group=1, { (  id=1, org_type='S', cnt=3 ), ( id=1, org_type='L", cnt=4)}
>>
>> Should it interpret org_type as S, or L?
>>
>> I suspect that what you actually want is something like this:
>>
>> -- load
>> research =      LOAD 'count'
>>                 USING PigStorage('\t')
>>                 as (id: long, org_type: chararray, dept_type: int,
>>                 cnt: int, cnt_distinct: int);
>>
>> -- break out the counts
>> research = FOREACH research
>>                 GENERATE id,
>>                 (org_type == 'S' AND dept_type == 1?cnt:0) AS sale,
>>                 (org_type == 'S' AND dept_type == 2?cnt:0) AS saleEx,
>>                 (org_type == 'C' AND dept_type == 2?cnt:0) AS acc_grp,
>>                 (org_type == 'C' AND dept_type == 4?cnt:0) AS acc_grp_manual,
>>                 (org_type == 'S' AND dept_type == 1?cnt_distinct:0) AS
>> sale_distinct,
>>                 (org_type == 'S' AND dept_type == 2?cnt_distinct:0) AS
>> saleEx_distinct,
>>                 (org_type == 'C' AND dept_type == 2?cnt_distinct:0) AS
>> acc_grp_distinct,
>>                 (org_type == 'C' AND dept_type == 4?cnt_distinct:0) AS
>> accM_grp_distinct;
>>
>> -- group and generate
>> grouped = group research by id;
>> final_data = FOREACH grouped GENERATE group as id, SUM(research.sale)
>> as total_sales, SUM(research.saleEx) as total_saleEx, .....;
>>
>> Or perhaps you want the pivot of that:
>>
>> research = LOAD....
>> grouped = group research by (id, org_type, dept_type);
>> final_data = FOREACH grouped GENERATE
>>   FLATTEN(group) as (id, org_type, dept_type),
>>  (group.org_type == 'S' AND group.dept_type == 1? SUM(research.cnt) :0) AS
>> sale,
>> .....
>>
>> Hope this helps.
>>
>> -Dmitriy
>>
>> On Thu, May 6, 2010 at 9:33 PM, Syed Wasti <[email protected]> wrote:
>>> Hi,
>>> Trying to make my script execute happily, it wont stop throwing errors.
>>> It works if I don¹t group my data. But once I group,
>>> It starts with;
>>> ERROR 1000: Error during parsing. Invalid alias: id  in {group:
>>> long,edu_affl: {id
>>> I got rid of it,  then it complains about Invalid alias: org_type,
>>> It keeps going and at the end it complains about mismatch in comparison, as
>>> the left side is a bag and right side is chararray. How ?
>>>
>>> Not sure, where I am going wrong, need help;
>>>
>>>
>>> research =      LOAD 'count'
>>>                USING PigStorage('\t')
>>>                as (id: long, org_type: chararray, dept_type: int,
>>>                cnt: int, cnt_distinct: int);
>>>
>>> final_grp =     GROUP research BY id;
>>>
>>> Final_table =   FOREACH final_grp
>>>                GENERATE id,
>>>                (org_type == 'S' AND dept_type == 1?cnt:0) AS sale,
>>>                (org_type == 'S' AND dept_type == 2?cnt:0) AS saleEx,
>>>                (org_type == 'C' AND dept_type == 2?cnt:0) AS acc_grp,
>>>                (org_type == 'C' AND dept_type == 4?cnt:0) AS
>>> acc_grp_manual,
>>>                (org_type == 'S' AND dept_type == 1?cnt_distinct:0) AS
>>> sale_distinct,
>>>                (org_type == 'S' AND dept_type == 2?cnt_distinct:0) AS
>>> saleEx_distinct,
>>>                (org_type == 'C' AND dept_type == 2?cnt_distinct:0) AS
>>> acc_grp_distinct,
>>>                (org_type == 'C' AND dept_type == 4?cnt_distinct:0) AS
>>> accM_grp_distinct;
>>>
>>> STORE final_table INTO 'final_count' USING PigStorage();
>>>
>>> Regards
>>> Syed
>>>
>>
>
>
>

Re: GROUP Error

Reply via email to