[ https://issues.apache.org/jira/browse/PIG-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018527#comment-18018527 ]
Rohini Palaniswamy commented on PIG-5474: ----------------------------------------- Attached [^PIG-5474-2.patch]. Contains change to compare output after sorting in unit tests as they fail in Spark mode with the order being different. > Casting error or empty output when as clause is used on a bag with schema not > defined > ------------------------------------------------------------------------------------- > > Key: PIG-5474 > URL: https://issues.apache.org/jira/browse/PIG-5474 > Project: Pig > Issue Type: Bug > Reporter: Rohini Palaniswamy > Assignee: Rohini Palaniswamy > Priority: Major > Fix For: 0.18.0 > > Attachments: PIG-5474-1.patch, PIG-5474-2.patch > > > Ran into an issue with where script that worked with older version of Pig > failed on Pig 0.17. It was a regression caused by PIG-2315 adding additional > POCast operators when there is an AS clause. > A script with below lines > {code} > G = FOREACH F GENERATE a0, > org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as > bag2; > H = FOREACH G GENERATE a0, FLATTEN(bag2) as (x1:chararray, x2:double, > x3:chararray, x4:long); > {code} > ran into this error > {code} > ERROR 1075: Received a bytearray from the UDF or Union from two different > Loaders. Cannot determine how to convert the bytearray to string for > [x1[-1,-1]] > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNextString(POCast.java:1125) > {code} > It was not easily reproducible with a simple script and required a sequence > of steps for the CastLineageSetter to not be able to set the LoadFunc that > will provide the caster on POCast - > https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/newplan/logical/visitor/CastLineageSetter.java#L108-L112 > causing the error in > https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1123-L1125. > > User trying to rewrite the script by moving the as clause to the UDF > statement instead of after FLATTEN, made the script pass. But all the bags > produced were empty because casting of the bag ( > https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1824-L1828 > ) swallowed the underlying exception and return null unlike the primitive > fields which throw error. > {code} > G = FOREACH F GENERATE a0, > org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as > bag2:{t:(a1:chararray, a2:double, a3:chararray, a4:long)}; > H = FOREACH G GENERATE a0, FLATTEN(bag2); > {code} > Also realized that this additional POCast has made the processing inefficient > in general as it tries to cast everything from bytearray to the type > specified in the as clause. If the UDF returned the correct type, lets say > Integer the code will still try to typecast to DataByteArray, hit a > ClassCastException and then will cast based on the realType > https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L503. > This is going to add a lot of overhead to processing when there are millions > of rows. -- This message was sent by Atlassian Jira (v8.20.10#820010)