Kristoffer Sjögren created PARQUET-697:
------------------------------------------

             Summary: ProtoMessageConverter fails for unknown proto fields
                 Key: PARQUET-697
                 URL: https://issues.apache.org/jira/browse/PARQUET-697
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-mr
    Affects Versions: 1.8.1
            Reporter: Kristoffer Sjögren


Hi

We have Spark application that reads parquet files and turns them into a 
Protobuf RDD like the code below [1]. However, if the parquet schema contain 
fields that doesn't exist in protobuf class an 
IncompatibleSchemaModificationException [2] is thrown. 

For compatibility reasons it would be nice to make it possible to ignore fields 
instead of throwing an exception. Maybe as an configuration? The fix for 
ignoring fields is quite easy, just instantiate an empty PrimitiveConverter 
instead.

Cheers,
-Kristoffer


[1]
JobConf conf = new JobConf(ctx.hadoopConfiguration());
FileInputFormat.setInputPaths(conf, rawPath);
ProtoReadSupport.setProtobufClass(conf, Msg.class.getName());
NewHadoopRDD<Void, Msg.Builder> rdd =
      new NewHadoopRDD(ctx.sc(), ProtoParquetInputFormat.class, void.class, 
Msg.class, conf);
rdd.toJavaRDD().foreach(log -> {
  System.out.println(log._2);
});

[2] 
https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java#L84

[3] converters[parquetFieldIndex - 1] = new PrimitiveConverter() {};



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to