[ 
https://issues.apache.org/jira/browse/PIG-3320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egil Sorensen updated PIG-3320:
-------------------------------

    Description: 
Somewhat different use case than PIG-3318:

Loading with AvroStorage giving a loader schema that relative to the schema in 
the Avro file had an extra filed w/o default and expected to see an extra empty 
column, but the schema is as in the avro file w/o the extra column.

E.g. see the e2e style test, which fails on this:

{code}
                        {
                        'num' => 2,
                        # storing using writer schema
                        # loading using reader schema with extra field that has 
no default
                        'notmq' => 1,
                        'pig' => q\
a = load ':INPATH:/types/numbers.txt' using PigStorage(':') as (intnum1000: 
int,id: int,intnum5: int,intnum100: int,intnum: int,longnum: long,floatnum: 
float,doublenum: double);

-- Store Avro file w. schema
b1 = foreach a generate id, intnum5;
c1 = filter b1 by 10 <= id and id < 20;
describe c1;
dump c1;
store c1 into ':OUTPATH:.intermediate_1' USING 
org.apache.pig.piggybank.storage.avro.AvroStorage('
{
   "schema" : {  
      "name" : "schema_writing",
      "type" : "record",
      "fields" : [
         {  
            "name" : "id",
            "type" : [
               "null",
               "int"
            ]
         },
         {  
            "name" : "intnum5",
            "type" : [
               "null",
               "int"
            ]
         }
      ]
   }
}
');

exec;


-- Read back what was stored with Avro adding extra field to reader schema
u = load ':OUTPATH:.intermediate_1' USING 
org.apache.pig.piggybank.storage.avro.AvroStorage('
{
   "debug" : 5,
   "schema" : {  
      "name" : "schema_reading",
      "type" : "record",
      "fields" : [
         {  
            "name" : "id",
            "type" : [
               "null",
               "int"
            ]
         },
         {  
            "name" : "intnum5",
            "type" : [
               "null",
               "string"
            ]
         },
         {
            "name" : "intnum100",
            "type" : [
               "null",
               "int"
            ]
         }
      ]
   }
}
');
describe u;
dump u;
store u into ':OUTPATH:';
\,

                        'verify_pig_script' => q\
a = load ':INPATH:/types/numbers.txt' using PigStorage(':') as (intnum1000: 
int,id: int,intnum5: int,intnum100: int,intnum: int,longnum: long,floatnum: 
float,doublenum: double);
b = filter a by (10 <= id and id < 20);
c = foreach b generate id, intnum5, '';
store c into ':OUTPATH:';
\,
                        },
{code}





  was:
Piggybank - AvroStorage. When merging multiple schemas where default values 
have been specified in the avro schema; 
The AvroStorage puts nulls in the merged data set. 

==> Employee3.avro <==
{
"type" : "record",
"name" : "employee",
"fields":[
        {"name" : "name", "type" : "string", "default" : "NU"},
        {"name" : "age", "type" : "int", "default" : 0 },
        {"name" : "dept", "type": "string", "default" : "DU"} ] }

==> Employee4.avro <==
{
"type" : "record",
"name" : "employee",
"fields":[
        {"name" : "name", "type" : "string", "default" : "NU"},
        {"name" : "age", "type" : "int", "default" : 0},
        {"name" : "dept", "type": "string", "default" : "DU"},
        {"name" : "office", "type": "string", "default" : "OU"} ] }

==> Employee6.avro <==
{
"type" : "record",
"name" : "employee",
"fields":[
        {"name" : "name", "type" : "string", "default" : "NU"},
        {"name" : "lastname", "type": "string", "default" : "LNU"},
        {"name" : "age", "type" : "int","default" : 0},
        {"name" : "salary", "type": "int", "default" : 0},
        {"name" : "dept", "type": "string","default" : "DU"},
        {"name" : "office", "type": "string","default" : "OU"} ] }

The pig script:
employee = load 'employee{3,4,6}.ser' using 
org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
describe employee;
dump employee;

Output Schema:
employee: {name: chararray,age: int,dept: chararray,lastname: chararray,salary: 
int,office: chararray}

(Milo,30,DH,,,)
(Asmya,34,PQ,,,)
(Baljit,23,RS,,,)
(Pune,60,Astrophysics,Warriors,5466,UTA)
(Rajsathan,20,Biochemistry,Royals,1378,Stanford)
(Chennai,50,Microbiology,Superkings,7338,Hopkins)
(Mumbai,20,Applied Math,Indians,4468,UAH)
(Praj,54,RMX,,,Champaign)
(Buba,767,HD,,,Sunnyvale)
(Manku,375,MS,,,New York)


Regards
Viraj

    
> AVRO: no empty field expressed when loading with AvroStorage using reader 
> schema with extra field that has no default
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-3320
>                 URL: https://issues.apache.org/jira/browse/PIG-3320
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.11.2
>            Reporter: Egil Sorensen
>            Assignee: Viraj Bhat
>              Labels: patch
>             Fix For: 0.12, 0.11.2
>
>
> Somewhat different use case than PIG-3318:
> Loading with AvroStorage giving a loader schema that relative to the schema 
> in the Avro file had an extra filed w/o default and expected to see an extra 
> empty column, but the schema is as in the avro file w/o the extra column.
> E.g. see the e2e style test, which fails on this:
> {code}
>                         {
>                         'num' => 2,
>                         # storing using writer schema
>                         # loading using reader schema with extra field that 
> has no default
>                         'notmq' => 1,
>                         'pig' => q\
> a = load ':INPATH:/types/numbers.txt' using PigStorage(':') as (intnum1000: 
> int,id: int,intnum5: int,intnum100: int,intnum: int,longnum: long,floatnum: 
> float,doublenum: double);
> -- Store Avro file w. schema
> b1 = foreach a generate id, intnum5;
> c1 = filter b1 by 10 <= id and id < 20;
> describe c1;
> dump c1;
> store c1 into ':OUTPATH:.intermediate_1' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage('
> {
>    "schema" : {  
>       "name" : "schema_writing",
>       "type" : "record",
>       "fields" : [
>          {  
>             "name" : "id",
>             "type" : [
>                "null",
>                "int"
>             ]
>          },
>          {  
>             "name" : "intnum5",
>             "type" : [
>                "null",
>                "int"
>             ]
>          }
>       ]
>    }
> }
> ');
> exec;
> -- Read back what was stored with Avro adding extra field to reader schema
> u = load ':OUTPATH:.intermediate_1' USING 
> org.apache.pig.piggybank.storage.avro.AvroStorage('
> {
>    "debug" : 5,
>    "schema" : {  
>       "name" : "schema_reading",
>       "type" : "record",
>       "fields" : [
>          {  
>             "name" : "id",
>             "type" : [
>                "null",
>                "int"
>             ]
>          },
>          {  
>             "name" : "intnum5",
>             "type" : [
>                "null",
>                "string"
>             ]
>          },
>          {
>             "name" : "intnum100",
>             "type" : [
>                "null",
>                "int"
>             ]
>          }
>       ]
>    }
> }
> ');
> describe u;
> dump u;
> store u into ':OUTPATH:';
> \,
>                         'verify_pig_script' => q\
> a = load ':INPATH:/types/numbers.txt' using PigStorage(':') as (intnum1000: 
> int,id: int,intnum5: int,intnum100: int,intnum: int,longnum: long,floatnum: 
> float,doublenum: double);
> b = filter a by (10 <= id and id < 20);
> c = foreach b generate id, intnum5, '';
> store c into ':OUTPATH:';
> \,
>                         },
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to