Tilak Patidar created GOBBLIN-571:
-------------------------------------
Summary: JsonIntermediateToParquetGroupConverter generates wrong
parquet schema for complex types such as enums, arrays and maps
Key: GOBBLIN-571
URL: https://issues.apache.org/jira/browse/GOBBLIN-571
Project: Apache Gobblin
Issue Type: Bug
Reporter: Tilak Patidar
For complex types such as arrays, maps and enums
JsonIntermediateToParquetGroupConverter is generating wrong schema. For enums,
arrays and maps the OPTIONAL and REQUIRED attribute of the SchemaField is
messed up.
Due to this spark throws the following errors when reading parquet files
generated using JsonIntermediateToParquetGroupConverter
{code:java}
Caused by: parquet.io.ParquetDecodingException: Can not read value at 0 {code}
Ex of a wrong schema generated is below. Notice the field payload.action is
marked as required
{code:java}
message EventData {
optional int64 id;
optional binary type (UTF8);
required group actor {
optional int64 id;
optional binary login (UTF8);
optional binary gravatar_id (UTF8);
optional binary url (UTF8);
optional binary avatar_url (UTF8);
}
required group repo {
optional int64 id;
optional binary name (UTF8);
optional binary url (UTF8);
optional binary urlid (UTF8);
}
required group payload {
optional int64 id;
optional binary ref (UTF8);
optional binary ref_type (UTF8);
optional binary master_branch (UTF8);
optional binary description (UTF8);
optional binary pusher_type (UTF8);
optional binary before (UTF8);
required binary action (UTF8);
}
optional boolean public;
optional binary created_at (UTF8);
optional binary created_at_id (UTF8);
}
{code}
But the field payload.action which is defined in the source.schema property is
set to isNullable: true
{code:java}
[ ....
{
"columnName": "payload",
"dataType": {
"type": "record",
"name": "payloadDetails",
"values": [
....
{
"columnName": "action",
"isNullable": true,
"dataType": {
"type": "enum",
"name": "actionType",
"symbols": [
"started",
"published",
"opened",
"closed",
"created",
"reopened",
"added"
]
}
}
]
}
}....
]
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)