[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

Alan Gates (JIRA) Tue, 02 Feb 2010 18:06:43 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828896#action_12828896
 ]


Alan Gates commented on PIG-1188:
---------------------------------

After further thought I want to change my position on this.

There are two cases to consider, when schema is present and when it isn't.  The 
problem is by the time Pig is trying to access the missing field (in the 
backend), it has no idea whether the schema exists or not.  So at runtime, Pig 
should just return a null if it gets ArrayOutOfBoundsException.

How to pad missing data should be left up to the load function.  Perhaps 
certain load functions do know how to pad missing data, or are ok with the pad 
at the end scheme proposed here.  If the load function does not check, then Pig 
would effectively pad at the end, given the proposal above.  If the load 
function implementer does not what this to happen, s/he can check each tuple 
being read from the input to assure it matches the schema, and then decide to 
pad the tuple with nulls, reject the tuple, or return a tuple full of nulls.

In the case of PigStorage, checking each tuple for a match against the schema 
is too expensive.  Ideally I would like it to, because I think that when the 
user gives a schema it's an error if the data doesn't match.  But I don't want 
to pay the performance penalty in this case.  

> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. 
> When we have schema, we should generate input data according to the schema, 
> and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

Reply via email to