Pradeep Kamath updated PIG-665:

    Status: Patch Available  (was: Open)

The issue as described in the description was that when the map plan has only 
POLoad, KeyTypeDiscoveryVisitor fails to find the map Key type (since it is 
only visiting POPackage and POLocalRearrange). The fix has the following 
changes to address the issue:
1) The visitor no longer visits POPackage since map key type information should 
only come from a POLocalRearrange
2) First the visitor tries to visit POLocalRearranges in the map plan - if it 
does not find the key type, it looks visits the reduce plans of its predecessor 
MapReduceOpers. If there are no predecessors, then this could be a simple 
load-store script, so the visitor normally terminates. If it discovers the same 
key types from its predecessors, it succeeds else it aborts since it should be 
getting identical key type from all its predecessors (which should be having 
the corresponding POLocalRearranges). If the visitor is unable to discover the 
key type in the map plan or in the predecessors, then a check is made if there 
is reduce phase to curent mapReduceOper. Only if there is a reduce phase and we 
are unable to discover the key type, the visitor aborts with a failure.

I have added a unit test case for the script in this issue. I have also added a 
visit method in POCombinerPackage as part of this patch since it was missing 

> Map key type not correctly set (for use when key is null) when map plan does 
> not have localrearrange
> ----------------------------------------------------------------------------------------------------
>                 Key: PIG-665
>                 URL: https://issues.apache.org/jira/browse/PIG-665
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>         Attachments: PIG-665.patch
> KeyTypeDiscoveryVisitor visits the map plan to figure out the datatype of the 
> map key. This is required so that when the map key is null, we can still 
> construct a valid NullableXXXWritable object to pass on to hadoop in the 
> collect() call (hadoop needs a valid object even for null objects). Currently 
> the KeyTypeDiscoveryVisitor only looks at POPackage and POLocalRearrange to 
> figure out the key type. In a pig script which results in multiple Map reduce 
> jobs, one of the jobs could have a map plan with only POLoads in it. In such 
> a case, the map key type is not discovered and this results in a null being 
> returned from HDataType.getWritableComparableTypes() method. This in turn 
> will result in a NullPointerException in the collect().
> Here is a script which can prompt this behavior:
> {code}
> a = load 'a.txt' as (x:int, y:int, z:int);
> b = load 'b.txt' as (x:int, y:int);
> b_group = group b by x;
> b_sum = foreach b_group generate flatten(group) as x, SUM(b.y) as clicks;
> a_group = group a by (x, y);
> a_aggs = foreach a_group {
>             generate 
>                 flatten(group) as (x, y),
>                 SUM(a.z) as zs;
>                 };
> join_a_b = join b_sum by x, a_aggs by x; --> the map plan for this join will 
> only have two POLoads which will result in the NullPointerException at 
> runtime in collect()
> dump join_a_b;
> {code} 
> Contents of a.txt (columns are tab separated):
> The first column of the first two rows is null (represented by an empty 
> column)
> {noformat}
>         7       8
>         8       9
> 1       20      30
> 1       20      40
> {noformat}
> Contents of b.txt (columns are tab separated):
> {noformat}
> 7       2
> 1       5
> 1       10
> {noformat}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to