[jira] Commented: (PIG-496) project of bags from complex data causes failures

Daniel Dai (JIRA) Mon, 10 Jan 2011 17:45:09 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979904#action_12979904
 ]


Daniel Dai commented on PIG-496:
--------------------------------

We need to decide how to load empty bag, eg.
{code}
A = load 'data.txt' as (x: bag{});
{code}
Currently, we load x as bag, inside x we don't do any interpretation. So what 
we load is a bag of bytearrays.

This however cause problem when we do further processing for this bag. Assume 
in data.txt, the bag actually contains three item tuples:
{code}
B = foreach A generate x.($1, $2); 
{code}
We expect it will project 2nd, 3th field of the tuple. But in current code, x 
is a bag of one field bytearray, this results an error
{code}
B = foreach A generate flatten x;
{code}
We expect it will flatten x into 3 fields. But in current code, we cannot even 
flatten x, since x does not contain tuple.

The problem stems in two sources:
1. Currently bag requires tuple in some cases, but not require tuple in other 
cases. This is inconsistent. We should make it a rule. So when we load a bag, 
actually means load a bag of tuples

2. When we load a tuple with unknown number of fields (tuple inner schema is 
unknown), we assume it contains only one bytearray field. However, it is not 
possible to cast one byte field to multiple fields later. Recall when we load a 
file with unknown schema:
{code}
A = load 'data.txt';
{code}
We actually load multiple fields seperated by delimit, each field is of type 
bytearray. When we load empty bag, we can mimic this behavior. 

So I propose two changes:
1. Load a bag implies loading a bag of tuples, even when bag inner schema is 
empty.
2. When we convert bytearray to tuple with no inner schema, we no longer assume 
one field. We will take comma as delimit (in the case of UTF8StorageConverter) 
and produce a tuple of multiple bytearray fields.

Assume data.txt is:
{(1,2,3),(4,5,6)}
After this change, 
A = load 'data.txt' as (x: bag{});
describe A:
We get: bag{}
dump A:
We get: {(1,2,3),(4,5,6)}, which is not a bag of byteArrays, but a bag of three 
item tuples.

> project of bags from complex data causes failures
> -------------------------------------------------
>
>                 Key: PIG-496
>                 URL: https://issues.apache.org/jira/browse/PIG-496
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Daniel Dai
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> A = load 'complex data' as (x: bag{});
> B = foreach A generate x.($1, $2);
> produces stack trace:
> 2008-10-14 15:11:07,639 [main] ERROR 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error 
> message from task (reduce) 
> task_200809241441_9923_r_000000java.lang.NullPointerException
>         at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183)
>         at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:215)
>         at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:166)
>         at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:252)
>         at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:222)
>         at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:134)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
>         at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Pradeep suspects that the problem is in 
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POProject.java;
>  line 374

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-496) project of bags from complex data causes failures

Reply via email to