[
https://issues.apache.org/jira/browse/PARQUET-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17174296#comment-17174296
]
ASF GitHub Bot commented on PARQUET-1455:
-----------------------------------------
gszadovszky commented on a change in pull request #561:
URL: https://github.com/apache/parquet-mr/pull/561#discussion_r467864366
##########
File path:
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
##########
@@ -222,11 +263,37 @@ public ProtoEnumConverter(ParentValueContainer parent,
Descriptors.FieldDescript
Descriptors.EnumValueDescriptor protoValue = enumLookup.get(binaryValue);
if (protoValue == null) {
- Set<Binary> knownValues = enumLookup.keySet();
- String msg = "Illegal enum value \"" + binaryValue + "\""
- + " in protocol buffer \"" + fieldType.getFullName() + "\""
- + " legal values are: \"" + knownValues + "\"";
- throw new InvalidRecordException(msg);
+ // in case of unknown enum value, protobuf is creating new
EnumValueDescriptor with the unknown number
+ // and name as following "UNKNOWN_ENUM_VALUE_" + parent.getName() +
"_" + number
+ // so the idea is to parse the name for data created by parquet-proto
before this patch
+ String unknownLabel = new String(binaryValue.getBytes());
Review comment:
I would suggest using `Binary.toStringUsingUTF8()` instead.
##########
File path:
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
##########
@@ -190,25 +211,45 @@ private Converter newScalarConverter(ParentValueContainer
pvc, Message.Builder p
private final Map<Binary, Descriptors.EnumValueDescriptor> enumLookup;
private Descriptors.EnumValueDescriptor[] dict;
private final ParentValueContainer parent;
+ private final Descriptors.EnumDescriptor enumType;
+ private final String unknownEnumPrefix;
+ private final boolean acceptUnknownEnum;
public ProtoEnumConverter(ParentValueContainer parent,
Descriptors.FieldDescriptor fieldType) {
this.parent = parent;
this.fieldType = fieldType;
- this.enumLookup = makeLookupStructure(fieldType);
+ this.enumType = fieldType.getEnumType();
+ this.enumLookup = makeLookupStructure(enumType);
+ unknownEnumPrefix = "UNKNOWN_ENUM_VALUE_" + enumType.getName() + "_";
+ acceptUnknownEnum = conf.getBoolean(CONFIG_ACCEPT_UNKNOWN_ENUM, false);
}
/**
* Fills lookup structure for translating between parquet enum values and
Protocol buffer enum values.
* */
- private Map<Binary, Descriptors.EnumValueDescriptor>
makeLookupStructure(Descriptors.FieldDescriptor enumFieldType) {
- Descriptors.EnumDescriptor enumType = enumFieldType.getEnumType();
+ private Map<Binary, Descriptors.EnumValueDescriptor>
makeLookupStructure(Descriptors.EnumDescriptor enumType) {
Map<Binary, Descriptors.EnumValueDescriptor> lookupStructure = new
HashMap<Binary, Descriptors.EnumValueDescriptor>();
- List<Descriptors.EnumValueDescriptor> enumValues = enumType.getValues();
+ if (extraMetadata.containsKey(METADATA_ENUM_PREFIX +
enumType.getFullName())) {
+ String enumNameNumberPairs = extraMetadata.get(METADATA_ENUM_PREFIX +
enumType.getFullName());
+ if (StringUtils.isBlank(enumNameNumberPairs)) {
+ LOG.info("No enum is written for " + enumType.getFullName());
Review comment:
As far as I understand this one will be logged quite often. We've
already had some about logging too much. I would suggest having this on debug
level instead.
##########
File path:
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
##########
@@ -22,7 +22,10 @@
import com.google.protobuf.Descriptors;
import com.google.protobuf.Message;
import com.twitter.elephantbird.util.Protobufs;
+import org.apache.commons.lang.StringUtils;
Review comment:
There is no direct dependency to `commons-lang` in `parquet-protobuf` so
I would not start using it for such a simple functionality.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [parquet-protobuf] Handle "unknown" enum values for parquet-protobuf
> --------------------------------------------------------------------
>
> Key: PARQUET-1455
> URL: https://issues.apache.org/jira/browse/PARQUET-1455
> Project: Parquet
> Issue Type: Bug
> Reporter: Qinghui Xu
> Assignee: Qinghui Xu
> Priority: Major
> Labels: pull-request-available
>
> Background -
> In protobuf enum is more like integers other than string, and is encoded as
> integer on the wire.
> In Protobuf, each enum value is associated with a number (integer), and
> people can set enum field using number directly regardless whether the number
> is associated to an enum value or not. While enum filed is set with a number
> that does not match any enum value defined in the schema, by using protobuf
> reflection API (as parquet-protobuf does) to read the enum field we will get
> a label "UNKNOWN_ENUM_<enumName>_<number>" generated by protobuf reflection.
> Thus parquet-protobuf will write string "UNKNOWN_ENUM_<enumName>_<number>"
> into the enum column whenever its protobuf schema does not recognize the
> number.
>
> Problematics -
> There are two cases of unknown enum while using parquet-protobuf:
> 1. Protobuf already contains unknown enum when we write it to parquet
> (sometimes people manipulate enum using numbers), so it will write a label
> "UNKNOWN_ENUM_*" as string in parquet. And when we read it back to protobuf,
> we found this "true" unknown value
> 2. Protobuf contains valid value when write to parquet, but the reader uses
> an outdated proto schema which misses some enum values. So the
> not-in-old-schema enum values are "unknown" to the reader.
> Current behavior of parquet-proto reader is to reject in both cases with some
> runtime exception. This does not make sense in case 1, the write part does
> respect protobuf enum behavior while the read part does not. And case 2
> should be handled if protobuf user is interested in the number instead of
> label.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)