[ 
https://issues.apache.org/jira/browse/DATAFU-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616338#comment-15616338
 ] 

Eyal Allweil commented on DATAFU-41:
------------------------------------

I'm trying to see if I understand what's happening here. Do you mean replacing 
the line

{noformat}
data3 = FOREACH data2 GENERATE group as id, BagGroup(data,data.key) as grouped;
{noformat}

with

{noformat}
data3 = FOREACH data2 GENERATE group as id, BagGroup(data.(key,val),data.key) 
as grouped;
{noformat}

When I do this, I indeed get the schema for data3 as described above - without 
a name for grouped data that BagGroup returns. But is this really a bug? 
Because it's receiving a bag without a name as input, so what name can it give? 
The name _data_ isn't being passed to the UDF at all in this case. ( I debugged 
and looked at the input schema's value in _BagGroup.getOutputSchema()_ )


> BagGroup does not name bag field in some cases
> ----------------------------------------------
>
>                 Key: DATAFU-41
>                 URL: https://issues.apache.org/jira/browse/DATAFU-41
>             Project: DataFu
>          Issue Type: Bug
>            Reporter: Matthew Hayes
>
> For this test:
> {code}
> /**
>   define BagSum datafu.pig.bags.BagSum();
>   define BagGroup datafu.pig.bags.BagGroup();
>   
>   data = LOAD 'input' USING PigStorage(',') AS (id:int, key:chararray, 
> val:int);
>   describe data;
>   
>   data2 = GROUP data BY id;
>   
>   describe data2;
>   
>   data3 = FOREACH data2 GENERATE group as id, BagGroup(data,data.key) as 
> grouped;
>   
>   describe data3;
>   
>   data4 = FOREACH data3 {
>     summed = FOREACH grouped GENERATE group as key, SUM($1.val) as total;
>     ordered = ORDER summed BY key;
>     GENERATE id, ordered;
>   }
>   
>   describe data4;
>   
>   STORE data4 INTO 'output';
>    */
>   @Multiline
>   private String bagSumTest;
>   
>   @Test
>   public void bagSumTest() throws Exception
>   {
>     PigTest test = createPigTestFromString(bagSumTest);
>     writeLinesToFile("input", "1,A,1","1,B,2","2,A,3","3,A,4","1,C,5","1,C,6",
>                      "3,A,7","2,B,8","1,A,9","2,A,10");
>     test.runScript();
>     assertOutput(test, "data4", 
>                  "(1,{(A,10),(B,2),(C,11)})",
>                  "(2,{(A,13),(B,8)})",
>                  "(3,{(A,11)})");
>   }
> {code}
> {{data3}} is described as:
> {code}
> data3: {id: int,grouped: {(group: chararray,data: {(id: int,key: 
> chararray,val: int)})}}
> {code}
> However, if we change {{data}} to {{data.(key,val)}} then {{data3}} is 
> described as:
> {code}
> data3: {id: int,grouped: {(group: chararray,{(key: chararray,val: int)})}}
> {code}
> Note that there is no name, so you have to reference it by {{$1}}.  There is 
> a separate issues, DATAFU-40, where even when it has the name {{data}} you 
> can run into problems later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to