[ 
https://issues.apache.org/jira/browse/PIG-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775961#action_12775961
 ] 

Alan Gates commented on PIG-1065:
---------------------------------

As originally defined UNION does allow two inputs to be of different schema, 
the result of which should have no schema.  So the error here is really that 
the runtime system doesn't adapt to the unexpected type.  I'm not sure how 
useful this functionality is, but it is in keeping with the original spirit of 
Pig's no schema required, so I'm not inclined to fix it.  At some point in the 
future we should consider how general we want to be in these cases, as there is 
significant cost for it.  But let's decide it as a whole for the language 
rather than piecemeal.

I would support the addition of a conforming union, (I have no idea what to 
call it) that requires that each input have either the same schema or that the 
two schemas be somehow compatible.  It would then handle type promotion and 
adding nulls for missing columns.

Finally, in response to Santhosh's comment on AS, I think whether we extend AS 
to statements beyond LOAD depends on choices we make above about schemas or 
lack thereof.  But removing it from the foreach isn't easy.  Consider the 
following:

{code}
B = foreach A generate $0, flatten($1), $2, flatten($3) AS (a:int, b:int, 
c:float, d:long, e:chararray);
{code}

If we don't have a schema for $1 and $3, we don't know whether $2 should end up 
being c:float or d:long.


> In-determinate behaviour of Union when there are 2 non-matching schema's
> ------------------------------------------------------------------------
>
>                 Key: PIG-1065
>                 URL: https://issues.apache.org/jira/browse/PIG-1065
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.6.0
>
>
> I have a script which first does a union of these schemas and then does a 
> ORDER BY of this result.
> {code}
> f1 = LOAD '1.txt' as (key:chararray, v:chararray);
> f2 = LOAD '2.txt' as (key:chararray);
> u0 = UNION f1, f2;
> describe u0;
> dump u0;
> u1 = ORDER u0 BY $0;
> dump u1;
> {code}
> When I run in Map Reduce mode I get the following result:
> $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig
> ====================
> Schema for u0 unknown.
> ====================
> (1,2)
> (2,3)
> (1)
> (2)
> ====================
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
> open iterator for alias u1
>         at org.apache.pig.PigServer.openIterator(PigServer.java:475)
>         at 
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
>         at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
>         at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>         at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
>         at org.apache.pig.Main.main(Main.java:397)
> ====================
> Caused by: java.io.IOException: Type mismatch in key from map: expected 
> org.apache.pig.impl.io.NullableBytesWritable, recieved 
> org.apache.pig.impl.io.NullableText
>         at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
>         at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
>         at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251)
>         at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
>         at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> ====================
> When I run the same script in local mode I get a different result, as we know 
> that local mode does not use any Hadoop Classes.
> $java -cp pig.jar org.apache.pig.Main -x local broken.pig
> ====================
> Schema for u0 unknown
> ====================
> (1,2)
> (1)
> (2,3)
> (2)
> ====================
> (1,2)
> (1)
> (2,3)
> (2)
> ====================
> Here are some questions
> 1) Why do we allow union if the schemas do not match
> 2) Should we not print an error message/warning so that the user knows that 
> this is not allowed or he can get unexpected results?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to