[ 
https://issues.apache.org/jira/browse/PARQUET-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1455.
---------------------------------------
    Resolution: Fixed

> [parquet-protobuf] Handle "unknown" enum values for parquet-protobuf
> --------------------------------------------------------------------
>
>                 Key: PARQUET-1455
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1455
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Qinghui Xu
>            Assignee: Qinghui Xu
>            Priority: Major
>              Labels: pull-request-available
>
> Background - 
> In protobuf enum is more like integers other than string, and is encoded as 
> integer on the wire.
> In Protobuf, each enum value is associated with a number (integer), and 
> people can set enum field using number directly regardless whether the number 
> is associated to an enum value or not. While enum filed is set with a number 
> that does not match any enum value defined in the schema, by using protobuf 
> reflection API (as parquet-protobuf does) to read the enum field we will get 
> a label "UNKNOWN_ENUM_<enumName>_<number>" generated by protobuf reflection. 
> Thus parquet-protobuf will write string "UNKNOWN_ENUM_<enumName>_<number>" 
> into the enum column whenever its protobuf schema does not recognize the 
> number.
>  
> Problematics -
> There are two cases of unknown enum while using parquet-protobuf:
>  1. Protobuf already contains unknown enum when we write it to parquet 
> (sometimes people manipulate enum using numbers), so it will write a label 
> "UNKNOWN_ENUM_*" as string in parquet. And when we read it back to protobuf, 
> we found this "true" unknown value
>  2. Protobuf contains valid value when write to parquet, but the reader uses 
> an outdated proto schema which misses some enum values. So the 
> not-in-old-schema enum values are "unknown" to the reader.
> Current behavior of parquet-proto reader is to reject in both cases with some 
> runtime exception. This does not make sense in case 1, the write part does 
> respect protobuf enum behavior while the read part does not. And case 2 
> should be handled if protobuf user is interested in the number instead of 
> label.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to