[
https://issues.apache.org/jira/browse/HIVE-28026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhihua Deng updated HIVE-28026:
-------------------------------
Fix Version/s: Not Applicable
(was: 4.0.1)
> Reading proto data more than 2GB from multiple splits fails
> -----------------------------------------------------------
>
> Key: HIVE-28026
> URL: https://issues.apache.org/jira/browse/HIVE-28026
> Project: Hive
> Issue Type: Bug
> Affects Versions: 4.0.0-beta-1
> Environment:
> Reporter: Raghav Aggarwal
> Assignee: Raghav Aggarwal
> Priority: Major
> Labels: pull-request-available
> Fix For: Not Applicable
>
>
> {*}Query{*}: select * from _<table_name>_
> {*}Explanation{*}:
> On running the above mentioned query on a hive proto table, multiple tez
> containers will be spawned to process the data. In a container, if there are
> multiple hdfs splits and the combined size of decompressed data is more than
> 2GB then the query fails with the following error:
> {code:java}
> "While parsing a protocol message, the input ended unexpectedly in the middle
> of a field. This could mean either that the input has been truncated or that
> an embedded message misreported its own length." {code}
>
> This is happening because of
> _[CodedInputStream|https://github.com/protocolbuffers/protobuf/blob/54489e95e01882407f356f83c9074415e561db00/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L2712C7-L2712C16]
> i.e. byteLimit += totalBytesRetired + pos;_
> _byteLimit_ is __ getting InterOverflow as _totalBytesRetired_ is retaining
> count of all the bytes that it has read as CodedInputStream is initiliazed
> once for a container
> [https://github.com/apache/hive/blob/564d7e54d2360488611da39d0e5f027a2d574fc1/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java#L96]
> .
>
> This is different from issue reproduced in
> [https://github.com/zabetak/protobuf-large-message] as there it is a single
> proto data file more than 2GB, but in my case, there are multiple file total
> resulting in 2GB.
> CC [~zabetak]
> *Limitation:*
> This fix will still not resolve the issue which is mentioned
> [https://github.com/protocolbuffers/protobuf/issues/11729]
> Here is DDL:
>
> {code:java}
> beeline -u
> 'jdbc:hive2://hostnames/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;thrift.client.max.message.size=2147483647'
> --showHeader=false --outputformat=tsv2 -e "select * from
> raaggarw.proto_hive_query_data where executionmode='MR' and otherinfo['CONF']
> != 'NULL'" >> ./output {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)