[ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835944#action_12835944 ]
Richard Ding commented on PIG-1188: ----------------------------------- To summarize where we are: Right now Pig project operator pads null if the value to be projected doesn't exist. As a consequence, the desired result is achieved if PigStorage is used and a schema with data types is specified, since in this case Pig inserts a project+cast operator for each field in the schema. In the case where no schema is specified in the load statement, Pig is doing a good job adhering to the Pig's philosophy and let the program run without throwing runtime exception. Now leave the case where a schema is specified without data types. There are several options: * Pig automatically insert a project operator for each field in the schema to ensure the input data matches the schema. The trade-off for this is the performance penalty. Is it worthwhile if most user data is well-behaved? * Users can explicitly add a foreach statement after the load statement which projects all the fields in the schema. This is similar to the practice by the users to run a map job first to cleanup the data. * Pig can also delegate the padding work to the loaders. The problem is that now the schema isn't passed to the loaders. > Padding nulls to the input tuple according to input schema > ---------------------------------------------------------- > > Key: PIG-1188 > URL: https://issues.apache.org/jira/browse/PIG-1188 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.6.0 > Reporter: Daniel Dai > Assignee: Richard Ding > Fix For: 0.7.0 > > > Currently, the number of fields in the input tuple is determined by the data. > When we have schema, we should generate input data according to the schema, > and padding nulls if necessary. Here is one example: > Pig script: > {code} > a = load '1.txt' as (a0, a1); > dump a; > {code} > Input file: > {code} > 1 2 > 1 2 3 > 1 > {code} > Current result: > {code} > (1,2) > (1,2,3) > (1) > {code} > Desired result: > {code} > (1,2) > (1,2) > (1, null) > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.