[
https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800609#action_12800609
]
Ankur commented on PIG-1191:
----------------------------
Listed below are the identified cases.
CASE 1: LOAD -> FILTER -> FOREACH -> LIMIT -> STORE
===================================================
SCRIPT
-----------
sds = LOAD '/my/data/location'
USING my.org.MyMapLoader()
AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
queries = FILTER sds BY mapFields#'page_params'#'query' is NOT NULL;
queries_rand = FOREACH queries
GENERATE (CHARARRAY) (mapFields#'page_params'#'query') AS
query_string;
queries_limit = LIMIT queries_rand 100;
STORE queries_limit INTO 'out';
RESULT
------------
FAILS in reduce stage with the following exception
org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a
bytearray from the UDF. Cannot determine
how to convert the bytearray to string.
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:423)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:391)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:371)
CASE 2: LOAD -> FOREACH -> FILTER -> LIMIT -> STORE
===================================================
Note that FILTER and FOREACH order is reversed
SCRIPT
-----------
sds = LOAD '/my/data/location'
USING my.org.MyMapLoader()
AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
queries_rand = FOREACH sds
GENERATE (CHARARRAY) (mapFields#'page_params'#'query') AS
query_string;
queries = FILTER queries_rand BY query_string IS NOT null;
queries_limit = LIMIT queries 100;
STORE queries_limit INTO 'out';
RESULT
-----------
SUCCESS - Results are correctly stored. So if a projection is done before
FILTER it recieves the LoadFunc in the POCast
operator and everything is cool.
CASE 3: LOAD -> FOREACH -> FOREACH -> FILTER -> LIMIT -> STORE
==============================================================
SCRIPT
-----------
ds = LOAD '/my/data/location'
USING my.org.MyMapLoader()
AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
params = FOREACH sds GENERATE
(map[]) (mapFields#'page_params') AS params;
queries = FOREACH params
GENERATE (CHARARRAY) (params#'query') AS query_string;
queries_filtered = FILTER queries
BY query_string IS NOT null;
queries_limit = LIMIT queries_filtered 100;
STORE queries_limit INTO 'out';
RESULT
-----------
FAILS in Map stage. Looks like the 2nd FOREACH did not get the loadFunc and
bailed out with following stack trace
org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a
bytearray from the UDF. Cannot determine
how to convert the bytearray to string. at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
at
CASE 4: LOAD -> FOREACH -> FOREACH -> LIMIT -> STORE
====================================================
SCRIPT
-----------
sds = LOAD '/my/data/location'
USING my.org.MyMapLoader()
AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
params = FOREACH sds GENERATE
(map[]) (mapFields#'page_params') AS params;
queries = FOREACH params
GENERATE (CHARARRAY) (params#'query') AS query_string;
queries_limit = LIMIT queries 100;
STORE queries_limit INTO 'out';
RESULT
-----------
SUCCESS. The two FOREACH seem to be getting the loadFunc.
CASE 5: LOAD -> FOREACH -> FOREACH -> FOREACH -> LIMIT -> STORE
================================================================
SCRIPT
-----------
ds = LOAD '/my/data/location'
USING my.org.MyMapLoader()
AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
params = FOREACH sds GENERATE
(map[]) (mapFields#'page_params') AS params;
queries = FOREACH params
GENERATE (CHARARRAY) (params#'query') AS query_string;
rand_queries = FOREACH queries GENERATE query_string as query;
queries_limit = LIMIT rand_queries 100;
STORE rand_queries INTO 'out';
RESULT
-----------
FAILS in map stage. Again the poor second FOREACH seems to be bailing out with
stack trace
org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a
bytearray from the UDF. Cannot determine
how to convert the bytearray to string. at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
> POCast throws exception for certain sequences of LOAD, FILTER, FORACH
> ---------------------------------------------------------------------
>
> Key: PIG-1191
> URL: https://issues.apache.org/jira/browse/PIG-1191
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.6.0
> Reporter: Ankur
> Priority: Blocker
> Attachments: PIG-1191-1.patch
>
>
> When using a custom load/store function, one that returns complex data (map
> of maps, list of maps), for certain sequences of LOAD, FILTER, FOREACH pig
> script throws an exception of the form -
>
> org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a
> bytearray from the UDF. Cannot determine how to convert the bytearray to
> <actual-type>
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
> ...
> Looking through the code of POCast, apparently the operator was unable to
> find the right load function for doing the conversion and consequently bailed
> out with the exception failing the entire pig script.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.