[ https://issues.apache.org/jira/browse/SPARK-19109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131877#comment-16131877 ]
Dongjoon Hyun commented on SPARK-19109: --------------------------------------- Hi, [~nseggert] and [~wangchao2017]. Could you give us a way to reproduce this? > ORC metadata section can sometimes exceed protobuf message size limit > --------------------------------------------------------------------- > > Key: SPARK-19109 > URL: https://issues.apache.org/jira/browse/SPARK-19109 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 > Reporter: Nic Eggert > > Basically, Spark inherits HIVE-11592 from its Hive dependency. From that > issue: > If there are too many small stripes and with many columns, the overhead for > storing metadata (column stats) can exceed the default protobuf message size > of 64MB. Reading such files will throw the following exception > {code} > Exception in thread "main" > com.google.protobuf.InvalidProtocolBufferException: Protocol message was too > large. May be malicious. Use CodedInputStream.setSizeLimit() to increase > the size limit. > at > com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110) > at > com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755) > at > com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:811) > at > com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1331) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1281) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369) > at > com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4887) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4803) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4990) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4985) > at > com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12925) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12872) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12961) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12956) > at > com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13599) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13546) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13635) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13630) > at > com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200) > at > com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217) > at > com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223) > at > com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.parseFrom(OrcProto.java:13746) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl$MetaInfoObjExtractor.<init>(ReaderImpl.java:468) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:314) > at > org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228) > at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:67) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.run(RunJar.java:221) > at org.apache.hadoop.util.RunJar.main(RunJar.java:136) > {code} > This is fixed in Hive 1.3, so it should be fairly straightforward to pick up > the patch. > As a side note: Spark's management of its Hive fork/dependency seems > incredibly arcane to me. Surely there's a better way than publishing to > central from developers' personal repos. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org