[ 
https://issues.apache.org/jira/browse/PARQUET-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183015#comment-17183015
 ] 

ASF GitHub Bot commented on PARQUET-1455:
-----------------------------------------

Fokko commented on pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561#issuecomment-678960902


   @qinghui-xu The CI is experiencing some connectivity issues, could you 
rebase against master to retrigger the build?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [parquet-protobuf] Handle "unknown" enum values for parquet-protobuf
> --------------------------------------------------------------------
>
>                 Key: PARQUET-1455
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1455
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Qinghui Xu
>            Assignee: Qinghui Xu
>            Priority: Major
>              Labels: pull-request-available
>
> Background - 
> In protobuf enum is more like integers other than string, and is encoded as 
> integer on the wire.
> In Protobuf, each enum value is associated with a number (integer), and 
> people can set enum field using number directly regardless whether the number 
> is associated to an enum value or not. While enum filed is set with a number 
> that does not match any enum value defined in the schema, by using protobuf 
> reflection API (as parquet-protobuf does) to read the enum field we will get 
> a label "UNKNOWN_ENUM_<enumName>_<number>" generated by protobuf reflection. 
> Thus parquet-protobuf will write string "UNKNOWN_ENUM_<enumName>_<number>" 
> into the enum column whenever its protobuf schema does not recognize the 
> number.
>  
> Problematics -
> There are two cases of unknown enum while using parquet-protobuf:
>  1. Protobuf already contains unknown enum when we write it to parquet 
> (sometimes people manipulate enum using numbers), so it will write a label 
> "UNKNOWN_ENUM_*" as string in parquet. And when we read it back to protobuf, 
> we found this "true" unknown value
>  2. Protobuf contains valid value when write to parquet, but the reader uses 
> an outdated proto schema which misses some enum values. So the 
> not-in-old-schema enum values are "unknown" to the reader.
> Current behavior of parquet-proto reader is to reject in both cases with some 
> runtime exception. This does not make sense in case 1, the write part does 
> respect protobuf enum behavior while the read part does not. And case 2 
> should be handled if protobuf user is interested in the number instead of 
> label.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to