[ https://issues.apache.org/jira/browse/PIG-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rohini Palaniswamy updated PIG-5474: ------------------------------------ Description: Ran into an issue with where script that worked with older version of Pig failed on Pig 0.17. It was a regression caused by PIG-2315 adding additional POCast operators when there is an AS clause. A script with below lines {code} G = FOREACH F GENERATE a0, org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as bag2; H = FOREACH G GENERATE a0, FLATTEN(bag2) as (x1:chararray, x2:double, x3:chararray, x4:long); {code} ran into this error {code} ERROR 1075: Received a bytearray from the UDF or Union from two different Loaders. Cannot determine how to convert the bytearray to string for [x1[-1,-1]] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNextString(POCast.java:1125) {code} It was not easily reproducible with a simple script and required a sequence of steps for the CastLineageSetter to not be able to set the LoadFunc that will provide the caster on POCast - https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/newplan/logical/visitor/CastLineageSetter.java#L108-L112 causing the error in https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1123-L1125. User trying to rewrite the script by moving the as clause to the UDF statement instead of after FLATTEN, made the script pass. But all the bags produced were empty because casting of the bag ( https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1824-L1828 ) swallowed the underlying exception and return null unlike the primitive fields which throw error. {code} G = FOREACH F GENERATE a0, org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as bag2:{t:(a1:chararray, a2:double, a3:chararray, a4:long)}; H = FOREACH G GENERATE a0, FLATTEN(bag2); {code} Also realized that this additional POCast has made the processing inefficient in general as it tries to cast everything from bytearray to the type specified in the as clause. If the UDF returned the correct type, lets say Integer the code will still try to typecast to DataByteArray, hit a ClassCastException and then will cast based on the realType https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L503. This is going to add a lot of overhead to processing when there are millions of rows. > Casting error or empty output when as clause is used on a bag with schema not > defined > ------------------------------------------------------------------------------------- > > Key: PIG-5474 > URL: https://issues.apache.org/jira/browse/PIG-5474 > Project: Pig > Issue Type: Bug > Reporter: Rohini Palaniswamy > Assignee: Rohini Palaniswamy > Priority: Major > Fix For: 0.18.0 > > Attachments: PIG-5474-1.patch > > > Ran into an issue with where script that worked with older version of Pig > failed on Pig 0.17. It was a regression caused by PIG-2315 adding additional > POCast operators when there is an AS clause. > A script with below lines > {code} > G = FOREACH F GENERATE a0, > org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as > bag2; > H = FOREACH G GENERATE a0, FLATTEN(bag2) as (x1:chararray, x2:double, > x3:chararray, x4:long); > {code} > ran into this error > {code} > ERROR 1075: Received a bytearray from the UDF or Union from two different > Loaders. Cannot determine how to convert the bytearray to string for > [x1[-1,-1]] > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNextString(POCast.java:1125) > {code} > It was not easily reproducible with a simple script and required a sequence > of steps for the CastLineageSetter to not be able to set the LoadFunc that > will provide the caster on POCast - > https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/newplan/logical/visitor/CastLineageSetter.java#L108-L112 > causing the error in > https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1123-L1125. > > User trying to rewrite the script by moving the as clause to the UDF > statement instead of after FLATTEN, made the script pass. But all the bags > produced were empty because casting of the bag ( > https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1824-L1828 > ) swallowed the underlying exception and return null unlike the primitive > fields which throw error. > {code} > G = FOREACH F GENERATE a0, > org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as > bag2:{t:(a1:chararray, a2:double, a3:chararray, a4:long)}; > H = FOREACH G GENERATE a0, FLATTEN(bag2); > {code} > Also realized that this additional POCast has made the processing inefficient > in general as it tries to cast everything from bytearray to the type > specified in the as clause. If the UDF returned the correct type, lets say > Integer the code will still try to typecast to DataByteArray, hit a > ClassCastException and then will cast based on the realType > https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L503. > This is going to add a lot of overhead to processing when there are millions > of rows. -- This message was sent by Atlassian Jira (v8.20.10#820010)