[ 
https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835944#action_12835944
 ] 

Richard Ding commented on PIG-1188:
-----------------------------------

To summarize where we are:

Right now Pig project operator pads null if the value to be projected doesn't 
exist. As a consequence, the desired result is achieved if  PigStorage is used 
and a schema with data types is specified, since in this case Pig inserts a 
project+cast operator for each field in the schema.

In the case where no schema is specified in the load statement, Pig is doing a 
good job adhering to the Pig's philosophy and  let the program run without 
throwing runtime exception.

Now leave the case where a schema is specified without data types. There are 
several options:

   * Pig automatically insert a project operator for each field in the schema 
to ensure the input data matches the schema. The trade-off for this is the 
performance penalty. Is it worthwhile if most user data is well-behaved?

   * Users can explicitly add a foreach statement after the load statement 
which projects all the fields in the schema. This is similar to the practice by 
the users to run a map job first to cleanup the data.  

   * Pig can also delegate the padding work to the loaders. The problem is that 
now  the schema isn't passed to the loaders. 





> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. 
> When we have schema, we should generate input data according to the schema, 
> and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to