[
https://issues.apache.org/jira/browse/GOBBLIN-571?focusedWorklogId=569932&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-569932
]
ASF GitHub Bot logged work on GOBBLIN-571:
------------------------------------------
Author: ASF GitHub Bot
Created on: 22/Mar/21 18:52
Start Date: 22/Mar/21 18:52
Worklog Time Spent: 10m
Work Description: vikrambohra closed pull request #3248:
URL: https://github.com/apache/gobblin/pull/3248
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 569932)
Time Spent: 40m (was: 0.5h)
> JsonIntermediateToParquetGroupConverter generates wrong parquet schema for
> complex types such as enums, arrays and maps
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: GOBBLIN-571
> URL: https://issues.apache.org/jira/browse/GOBBLIN-571
> Project: Apache Gobblin
> Issue Type: Bug
> Reporter: Tilak Patidar
> Assignee: Shirshanka Das
> Priority: Critical
> Fix For: 0.15.0
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> For complex types such as arrays, maps and enums
> JsonIntermediateToParquetGroupConverter is generating wrong schema. For
> enums, arrays and maps the OPTIONAL and REQUIRED attribute of the SchemaField
> is messed up.
>
> Due to this spark throws the following errors when reading parquet files
> generated using JsonIntermediateToParquetGroupConverter
> {code:java}
> Caused by: parquet.io.ParquetDecodingException: Can not read value at 0 {code}
> Ex of a wrong schema generated is below. Notice the field payload.action is
> marked as required
> {code:java}
> message EventData {
> optional int64 id;
> optional binary type (UTF8);
> required group actor {
> optional int64 id;
> optional binary login (UTF8);
> optional binary gravatar_id (UTF8);
> optional binary url (UTF8);
> optional binary avatar_url (UTF8);
> }
> required group repo {
> optional int64 id;
> optional binary name (UTF8);
> optional binary url (UTF8);
> optional binary urlid (UTF8);
> }
> required group payload {
> optional int64 id;
> optional binary ref (UTF8);
> optional binary ref_type (UTF8);
> optional binary master_branch (UTF8);
> optional binary description (UTF8);
> optional binary pusher_type (UTF8);
> optional binary before (UTF8);
> required binary action (UTF8);
> }
> optional boolean public;
> optional binary created_at (UTF8);
> optional binary created_at_id (UTF8);
> }
> {code}
> But the field payload.action which is defined in the source.schema property
> is set to isNullable: true
> {code:java}
> [ ....
> {
> "columnName": "payload",
> "dataType": {
> "type": "record",
> "name": "payloadDetails",
> "values": [
> ....
> {
> "columnName": "action",
> "isNullable": true,
> "dataType": {
> "type": "enum",
> "name": "actionType",
> "symbols": [
> "started",
> "published",
> "opened",
> "closed",
> "created",
> "reopened",
> "added"
> ]
> }
> }
> ]
> }
> }....
> ]
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)