[ https://issues.apache.org/jira/browse/SPARK-19109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16328330#comment-16328330 ]
Dongjoon Hyun edited comment on SPARK-19109 at 1/17/18 6:27 AM: ---------------------------------------------------------------- HIVE-11592 is fixed in Hive 1.3.0 and ORC 1.4.1 library has the patch. Since SPARK-20682 / SPARK-20728 / SPARK-22279, we are using native ORC implementation based on ORC 1.4.1. This issue is fixed in Apache Spark default configuration. {code} public static final int PROTOBUF_MESSAGE_MAX_LIMIT = 1024 << 20; // 1GB {code} was (Author: dongjoon): HIVE-11592 is fixed in Hive 1.3.0 and ORC 1.4.1 library has the patch. Since SPARK-20682 / SPARK-20728 / SPARK-22279, we are using native ORC implementation based on ORC 1.4.1. This issue is fixed by default configuration. {code} public static final int PROTOBUF_MESSAGE_MAX_LIMIT = 1024 << 20; // 1GB {code} > ORC metadata section can sometimes exceed protobuf message size limit > --------------------------------------------------------------------- > > Key: SPARK-19109 > URL: https://issues.apache.org/jira/browse/SPARK-19109 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 > Reporter: Nic Eggert > Priority: Major > Attachments: InsertPic_.png > > > Basically, Spark inherits HIVE-11592 from its Hive dependency. From that > issue: > If there are too many small stripes and with many columns, the overhead for > storing metadata (column stats) can exceed the default protobuf message size > of 64MB. Reading such files will throw the following exception > {code} > Exception in thread "main" > com.google.protobuf.InvalidProtocolBufferException: Protocol message was too > large. May be malicious. Use CodedInputStream.setSizeLimit() to increase > the size limit. > at > com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110) > at > com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755) > at > com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:811) > at > com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1331) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1281) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369) > at > com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4887) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4803) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4990) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4985) > at > com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12925) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12872) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12961) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12956) > at > com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13599) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13546) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13635) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13630) > at > com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200) > at > com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217) > at > com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223) > at > com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.parseFrom(OrcProto.java:13746) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl$MetaInfoObjExtractor.<init>(ReaderImpl.java:468) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:314) > at > org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228) > at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:67) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.run(RunJar.java:221) > at org.apache.hadoop.util.RunJar.main(RunJar.java:136) > {code} > This is fixed in Hive 1.3, so it should be fairly straightforward to pick up > the patch. > As a side note: Spark's management of its Hive fork/dependency seems > incredibly arcane to me. Surely there's a better way than publishing to > central from developers' personal repos. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org