[ 
https://issues.apache.org/jira/browse/PIG-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18018527#comment-18018527
 ] 

Rohini Palaniswamy commented on PIG-5474:
-----------------------------------------

Attached [^PIG-5474-2.patch]. Contains change to compare output after sorting 
in unit tests as they fail in Spark mode with the order being different.

> Casting error or empty output when as clause is used on a bag with schema not 
> defined
> -------------------------------------------------------------------------------------
>
>                 Key: PIG-5474
>                 URL: https://issues.apache.org/jira/browse/PIG-5474
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>            Priority: Major
>             Fix For: 0.18.0
>
>         Attachments: PIG-5474-1.patch, PIG-5474-2.patch
>
>
> Ran into an issue with where script that worked with older version of Pig 
> failed on Pig 0.17. It was a regression caused by PIG-2315 adding additional 
> POCast operators when there is an AS clause.
> A script with below lines 
> {code}
> G = FOREACH F GENERATE a0, 
> org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as 
> bag2;
> H = FOREACH G GENERATE a0, FLATTEN(bag2) as (x1:chararray, x2:double, 
> x3:chararray, x4:long);
> {code}
> ran into this error
> {code}
> ERROR 1075: Received a bytearray from the UDF or Union from two different 
> Loaders. Cannot determine how to convert the bytearray to string for 
> [x1[-1,-1]]
>         at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNextString(POCast.java:1125)
> {code}
> It was not easily reproducible with a simple script and required a sequence 
> of steps for the CastLineageSetter to not be able to set the LoadFunc that 
> will provide the caster on POCast - 
> https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/newplan/logical/visitor/CastLineageSetter.java#L108-L112
>  causing the error in 
> https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1123-L1125.
>  
> User trying to rewrite the script by moving the as clause to the UDF 
> statement instead of after FLATTEN, made the script pass. But all the bags 
> produced were empty because casting of the bag ( 
> https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1824-L1828
>  ) swallowed the underlying exception and return null unlike the primitive 
> fields which throw error.
>  {code}
> G = FOREACH F GENERATE a0, 
> org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as 
> bag2:{t:(a1:chararray, a2:double, a3:chararray, a4:long)};
> H = FOREACH G GENERATE a0, FLATTEN(bag2);
> {code}
> Also realized that this additional POCast has made the processing inefficient 
> in general as it tries to cast everything from bytearray to the type 
> specified in the as clause. If the UDF returned the correct type, lets say 
> Integer the code will still try to typecast to DataByteArray, hit a 
> ClassCastException and then will cast based on the realType  
> https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L503.
>  This is going to add a lot of overhead to processing when there are millions 
> of rows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to