Tilak Patidar created GOBBLIN-571:
-------------------------------------

             Summary: JsonIntermediateToParquetGroupConverter generates wrong 
parquet schema for complex types such as enums, arrays and maps
                 Key: GOBBLIN-571
                 URL: https://issues.apache.org/jira/browse/GOBBLIN-571
             Project: Apache Gobblin
          Issue Type: Bug
            Reporter: Tilak Patidar


For complex types such as arrays, maps and enums 

JsonIntermediateToParquetGroupConverter is generating wrong schema. For enums, 
arrays and maps the OPTIONAL and REQUIRED attribute of the SchemaField is 
messed up.

 

Due to this spark throws the following errors when reading parquet files 
generated using JsonIntermediateToParquetGroupConverter
{code:java}
Caused by: parquet.io.ParquetDecodingException: Can not read value at 0 {code}
Ex of a wrong schema generated is below. Notice the field payload.action is 
marked as required
{code:java}
message EventData {
optional int64 id;
optional binary type (UTF8);
required group actor {
optional int64 id;
optional binary login (UTF8);
optional binary gravatar_id (UTF8);
optional binary url (UTF8);
optional binary avatar_url (UTF8);
}
required group repo {
optional int64 id;
optional binary name (UTF8);
optional binary url (UTF8);
optional binary urlid (UTF8);
}
required group payload {
optional int64 id;
optional binary ref (UTF8);
optional binary ref_type (UTF8);
optional binary master_branch (UTF8);
optional binary description (UTF8);
optional binary pusher_type (UTF8);
optional binary before (UTF8);
required binary action (UTF8);
}
optional boolean public;
optional binary created_at (UTF8);
optional binary created_at_id (UTF8);
}
{code}
But the field payload.action which is defined in the source.schema property is 
set to isNullable: true
{code:java}
[ ....
    {
    "columnName": "payload",
    "dataType": {
      "type": "record",
      "name": "payloadDetails",
      "values": [
        ....
        {
          "columnName": "action",
          "isNullable": true,
          "dataType": {
            "type": "enum",
            "name": "actionType",
            "symbols": [
              "started",
              "published",
              "opened",
              "closed",
              "created",
              "reopened",
              "added"
            ]
          }
        }
      ]
    }
  }....
]

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to