Re: [jira] Created: (PIG-723) Pig generates incorrect schema for generated bags after FOREACH.

Mridul Muralidharan Wed, 18 Mar 2009 04:25:40 -0700

Hi,


  I think the schema generated is right.

It might a problem with declaring complex types (bags only ?) in termsof scalars in the script which you might be hitting (iirc there is a bugfor it already).

From what I recall, even I have faced issues with things like "foreachA generate $0 as f1, {(1.0, 'str')} as f1;" type declarations - pigseems to not be able to handle them currently.


If this is the issue, rest of the mail might help you -

The workaround we have is to have a dummy udf which returns this value -and use that udf.



An example snippet which we use :


--- A custom udf which generates bag with tuples (0, '')
define GENERATE_EMPTY_BAG1 myudf.GenerateEmptyBag('long, chararray');
--- A custom udf which generates bag with tuples (0, '', 0.0f)
define GENERATE_EMPTY_BAG1 myudf.GenerateEmptyBag('long, chararray, float');

grp_op = COGROUP inp1 by eid, inp2 by eid6;

res = FOREACH grp_op { GENERATE FLATTEN((COUNT(inp1) != 0 ? inp1 :GENERATE_EMPTY_BAG1(0L))) AS (eid:long, query:chararray),FLATTEN((COUNT(inp2) != 0 ? inp2 : GENERATE_EMPTY_BAG2(0L))) AS(eid6:long, candidate_url6:chararray, rank:float); };

---

The idea above is to have some default values when the bags are empty(that is, no data for a particular eid/eid6).Note the exact syntax in foreach (defends against parser bugs in pig),and the use of the udf - the generate the bag.




Regards,
Mridul

Dhruv M (JIRA) wrote:

Pig generates incorrect schema for generated bags after FOREACH.
----------------------------------------------------------------

                 Key: PIG-723
                 URL: https://issues.apache.org/jira/browse/PIG-723
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.1.0
         Environment: Linux
$pig --version
Apache Pig version 0.1.0-dev (r750430)
compiled Mar 07 2009, 09:20:13

            Reporter: Dhruv M
            Priority: Critical
grunt> rf_src = LOAD 'rf_test.txt' USING PigStorage(',') AS (lhs:chararray, rhs:chararray, r:float, p:float, c:float);grunt> rf_grouped = GROUP rf_src BY rhs;grunt> lhs_grouped = FOREACH rf_grouped GENERATE group as rhs, rf_src.(lhs, r) as lhs, MAX(rf_src.p) as p, MAX(rf_src.c) AS c;grunt> describe lhs_grouped;
lhs_grouped: {rhs: chararray,lhs: {lhs: chararray,r: float},p: float,c: float}

I think it should be:
lhs_grouped: {rhs: chararray,lhs: {(lhs: chararray,r: float)},p: float,c: float}

Because of this, we are not able to perform UNION on 2 sets because union on 
incompatible schemas is causing a complete loss of schema information, making 
further processing impossible.

This is what we want to UNION with:
grunt> asrc = LOAD 'atest.txt' USING PigStorage(',') AS (rhs:chararray, a:int);grunt> aa = FOREACH asrc GENERATE rhs, (bag{tuple(chararray,float)}) null as lhs, -10F as p, -10F as c;
grunt> describe aa;
aa: {rhs: chararray,lhs: {(chararray,float)},p: float,c: float}

If there is something wrong with what I am trying to do, please let me know.

Re: [jira] Created: (PIG-723) Pig generates incorrect schema for generated bags after FOREACH.

Reply via email to