[
https://issues.apache.org/jira/browse/PIG-496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979904#action_12979904
]
Daniel Dai commented on PIG-496:
--------------------------------
We need to decide how to load empty bag, eg.
{code}
A = load 'data.txt' as (x: bag{});
{code}
Currently, we load x as bag, inside x we don't do any interpretation. So what
we load is a bag of bytearrays.
This however cause problem when we do further processing for this bag. Assume
in data.txt, the bag actually contains three item tuples:
{code}
B = foreach A generate x.($1, $2);
{code}
We expect it will project 2nd, 3th field of the tuple. But in current code, x
is a bag of one field bytearray, this results an error
{code}
B = foreach A generate flatten x;
{code}
We expect it will flatten x into 3 fields. But in current code, we cannot even
flatten x, since x does not contain tuple.
The problem stems in two sources:
1. Currently bag requires tuple in some cases, but not require tuple in other
cases. This is inconsistent. We should make it a rule. So when we load a bag,
actually means load a bag of tuples
2. When we load a tuple with unknown number of fields (tuple inner schema is
unknown), we assume it contains only one bytearray field. However, it is not
possible to cast one byte field to multiple fields later. Recall when we load a
file with unknown schema:
{code}
A = load 'data.txt';
{code}
We actually load multiple fields seperated by delimit, each field is of type
bytearray. When we load empty bag, we can mimic this behavior.
So I propose two changes:
1. Load a bag implies loading a bag of tuples, even when bag inner schema is
empty.
2. When we convert bytearray to tuple with no inner schema, we no longer assume
one field. We will take comma as delimit (in the case of UTF8StorageConverter)
and produce a tuple of multiple bytearray fields.
Assume data.txt is:
{(1,2,3),(4,5,6)}
After this change,
A = load 'data.txt' as (x: bag{});
describe A:
We get: bag{}
dump A:
We get: {(1,2,3),(4,5,6)}, which is not a bag of byteArrays, but a bag of three
item tuples.
> project of bags from complex data causes failures
> -------------------------------------------------
>
> Key: PIG-496
> URL: https://issues.apache.org/jira/browse/PIG-496
> Project: Pig
> Issue Type: Bug
> Reporter: Olga Natkovich
> Assignee: Daniel Dai
> Priority: Minor
> Fix For: 0.9.0
>
>
> A = load 'complex data' as (x: bag{});
> B = foreach A generate x.($1, $2);
> produces stack trace:
> 2008-10-14 15:11:07,639 [main] ERROR
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error
> message from task (reduce)
> task_200809241441_9923_r_000000java.lang.NullPointerException
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:215)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:166)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:252)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:222)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:134)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
> at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Pradeep suspects that the problem is in
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POProject.java;
> line 374
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.