Kristoffer Sjögren created PARQUET-697:
------------------------------------------
Summary: ProtoMessageConverter fails for unknown proto fields
Key: PARQUET-697
URL: https://issues.apache.org/jira/browse/PARQUET-697
Project: Parquet
Issue Type: Improvement
Components: parquet-mr
Affects Versions: 1.8.1
Reporter: Kristoffer Sjögren
Hi
We have Spark application that reads parquet files and turns them into a
Protobuf RDD like the code below [1]. However, if the parquet schema contain
fields that doesn't exist in protobuf class an
IncompatibleSchemaModificationException [2] is thrown.
For compatibility reasons it would be nice to make it possible to ignore fields
instead of throwing an exception. Maybe as an configuration? The fix for
ignoring fields is quite easy, just instantiate an empty PrimitiveConverter
instead.
Cheers,
-Kristoffer
[1]
JobConf conf = new JobConf(ctx.hadoopConfiguration());
FileInputFormat.setInputPaths(conf, rawPath);
ProtoReadSupport.setProtobufClass(conf, Msg.class.getName());
NewHadoopRDD<Void, Msg.Builder> rdd =
new NewHadoopRDD(ctx.sc(), ProtoParquetInputFormat.class, void.class,
Msg.class, conf);
rdd.toJavaRDD().foreach(log -> {
System.out.println(log._2);
});
[2]
https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java#L84
[3] converters[parquetFieldIndex - 1] = new PrimitiveConverter() {};
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)