[ 
https://issues.apache.org/jira/browse/PIG-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-665:
-------------------------------

    Attachment: PIG-665.patch

> Map key type not correctly set (for use when key is null) when map plan does 
> not have localrearrange
> ----------------------------------------------------------------------------------------------------
>
>                 Key: PIG-665
>                 URL: https://issues.apache.org/jira/browse/PIG-665
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>
>         Attachments: PIG-665.patch
>
>
> KeyTypeDiscoveryVisitor visits the map plan to figure out the datatype of the 
> map key. This is required so that when the map key is null, we can still 
> construct a valid NullableXXXWritable object to pass on to hadoop in the 
> collect() call (hadoop needs a valid object even for null objects). Currently 
> the KeyTypeDiscoveryVisitor only looks at POPackage and POLocalRearrange to 
> figure out the key type. In a pig script which results in multiple Map reduce 
> jobs, one of the jobs could have a map plan with only POLoads in it. In such 
> a case, the map key type is not discovered and this results in a null being 
> returned from HDataType.getWritableComparableTypes() method. This in turn 
> will result in a NullPointerException in the collect().
> Here is a script which can prompt this behavior:
> {code}
> a = load 'a.txt' as (x:int, y:int, z:int);
> b = load 'b.txt' as (x:int, y:int);
> b_group = group b by x;
> b_sum = foreach b_group generate flatten(group) as x, SUM(b.y) as clicks;
> a_group = group a by (x, y);
> a_aggs = foreach a_group {
>             generate 
>                 flatten(group) as (x, y),
>                 SUM(a.z) as zs;
>                 };
> join_a_b = join b_sum by x, a_aggs by x; --> the map plan for this join will 
> only have two POLoads which will result in the NullPointerException at 
> runtime in collect()
> dump join_a_b;
> {code} 
> Contents of a.txt (columns are tab separated):
> The first column of the first two rows is null (represented by an empty 
> column)
> {noformat}
>         7       8
>         8       9
> 1       20      30
> 1       20      40
> {noformat}
> Contents of b.txt (columns are tab separated):
> {noformat}
> 7       2
> 1       5
> 1       10
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to