[ 
https://issues.apache.org/jira/browse/PIG-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5474:
------------------------------------
    Description: 
Ran into an issue with where script that worked with older version of Pig 
failed on Pig 0.17. It was a regression caused by PIG-2315 adding additional 
POCast operators when there is an AS clause.

A script with below lines 
{code}
G = FOREACH F GENERATE a0, 
org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as bag2;
H = FOREACH G GENERATE a0, FLATTEN(bag2) as (x1:chararray, x2:double, 
x3:chararray, x4:long);
{code}

ran into this error
{code}
ERROR 1075: Received a bytearray from the UDF or Union from two different 
Loaders. Cannot determine how to convert the bytearray to string for [x1[-1,-1]]
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNextString(POCast.java:1125)
{code}

It was not easily reproducible with a simple script and required a sequence of 
steps for the CastLineageSetter to not be able to set the LoadFunc that will 
provide the caster on POCast - 
https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/newplan/logical/visitor/CastLineageSetter.java#L108-L112
 causing the error in 
https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1123-L1125.
 

User trying to rewrite the script by moving the as clause to the UDF statement 
instead of after FLATTEN, made the script pass. But all the bags produced were 
empty because casting of the bag ( 
https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1824-L1828
 ) swallowed the underlying exception and return null unlike the primitive 
fields which throw error.

 {code}
G = FOREACH F GENERATE a0, 
org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as 
bag2:{t:(a1:chararray, a2:double, a3:chararray, a4:long)};
H = FOREACH G GENERATE a0, FLATTEN(bag2);
{code}

Also realized that this additional POCast has made the processing inefficient 
in general as it tries to cast everything from bytearray to the type specified 
in the as clause. If the UDF returned the correct type, lets say Integer the 
code will still try to typecast to DataByteArray, hit a ClassCastException and 
then will cast based on the realType  
https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L503.
 This is going to add a lot of overhead to processing when there are millions 
of rows.

> Casting error or empty output when as clause is used on a bag with schema not 
> defined
> -------------------------------------------------------------------------------------
>
>                 Key: PIG-5474
>                 URL: https://issues.apache.org/jira/browse/PIG-5474
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>            Priority: Major
>             Fix For: 0.18.0
>
>         Attachments: PIG-5474-1.patch
>
>
> Ran into an issue with where script that worked with older version of Pig 
> failed on Pig 0.17. It was a regression caused by PIG-2315 adding additional 
> POCast operators when there is an AS clause.
> A script with below lines 
> {code}
> G = FOREACH F GENERATE a0, 
> org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as 
> bag2;
> H = FOREACH G GENERATE a0, FLATTEN(bag2) as (x1:chararray, x2:double, 
> x3:chararray, x4:long);
> {code}
> ran into this error
> {code}
> ERROR 1075: Received a bytearray from the UDF or Union from two different 
> Loaders. Cannot determine how to convert the bytearray to string for 
> [x1[-1,-1]]
>         at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNextString(POCast.java:1125)
> {code}
> It was not easily reproducible with a simple script and required a sequence 
> of steps for the CastLineageSetter to not be able to set the LoadFunc that 
> will provide the caster on POCast - 
> https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/newplan/logical/visitor/CastLineageSetter.java#L108-L112
>  causing the error in 
> https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1123-L1125.
>  
> User trying to rewrite the script by moving the as clause to the UDF 
> statement instead of after FLATTEN, made the script pass. But all the bags 
> produced were empty because casting of the bag ( 
> https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L1824-L1828
>  ) swallowed the underlying exception and return null unlike the primitive 
> fields which throw error.
>  {code}
> G = FOREACH F GENERATE a0, 
> org.apache.pig.test.TestFlatten$UDFWithNoOutputSchema(a0, a1, b1, bag1) as 
> bag2:{t:(a1:chararray, a2:double, a3:chararray, a4:long)};
> H = FOREACH G GENERATE a0, FLATTEN(bag2);
> {code}
> Also realized that this additional POCast has made the processing inefficient 
> in general as it tries to cast everything from bytearray to the type 
> specified in the as clause. If the UDF returned the correct type, lets say 
> Integer the code will still try to typecast to DataByteArray, hit a 
> ClassCastException and then will cast based on the realType  
> https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POCast.java#L503.
>  This is going to add a lot of overhead to processing when there are millions 
> of rows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to