[ https://issues.apache.org/jira/browse/PARQUET-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabor Szadovszky resolved PARQUET-1455. --------------------------------------- Resolution: Fixed > [parquet-protobuf] Handle "unknown" enum values for parquet-protobuf > -------------------------------------------------------------------- > > Key: PARQUET-1455 > URL: https://issues.apache.org/jira/browse/PARQUET-1455 > Project: Parquet > Issue Type: Bug > Reporter: Qinghui Xu > Assignee: Qinghui Xu > Priority: Major > Labels: pull-request-available > > Background - > In protobuf enum is more like integers other than string, and is encoded as > integer on the wire. > In Protobuf, each enum value is associated with a number (integer), and > people can set enum field using number directly regardless whether the number > is associated to an enum value or not. While enum filed is set with a number > that does not match any enum value defined in the schema, by using protobuf > reflection API (as parquet-protobuf does) to read the enum field we will get > a label "UNKNOWN_ENUM_<enumName>_<number>" generated by protobuf reflection. > Thus parquet-protobuf will write string "UNKNOWN_ENUM_<enumName>_<number>" > into the enum column whenever its protobuf schema does not recognize the > number. > > Problematics - > There are two cases of unknown enum while using parquet-protobuf: > 1. Protobuf already contains unknown enum when we write it to parquet > (sometimes people manipulate enum using numbers), so it will write a label > "UNKNOWN_ENUM_*" as string in parquet. And when we read it back to protobuf, > we found this "true" unknown value > 2. Protobuf contains valid value when write to parquet, but the reader uses > an outdated proto schema which misses some enum values. So the > not-in-old-schema enum values are "unknown" to the reader. > Current behavior of parquet-proto reader is to reject in both cases with some > runtime exception. This does not make sense in case 1, the write part does > respect protobuf enum behavior while the read part does not. And case 2 > should be handled if protobuf user is interested in the number instead of > label. > -- This message was sent by Atlassian Jira (v8.3.4#803005)