[
https://issues.apache.org/jira/browse/HIVE-28026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raghav Aggarwal updated HIVE-28026:
-----------------------------------
Description:
{*}Query{*}: select * from _<table_name>_
{*}Explanation{*}:
On running the above mentioned query on a hive proto table, multiple tez
containers will be spawned to process the data. In a container, if there are
multiple hdfs splits and the combined size of decompressed data is more than
2GB then the query fails with the following error:
{code:java}
"While parsing a protocol message, the input ended unexpectedly in the middle
of a field. This could mean either that the input has been truncated or that
an embedded message misreported its own length." {code}
This is happening because of
_[CodedInputStream|https://github.com/protocolbuffers/protobuf/blob/54489e95e01882407f356f83c9074415e561db00/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L2712C7-L2712C16]
i.e. byteLimit += totalBytesRetired + pos;_
_byteLimit_ is __ getting InterOverflow as _totalBytesRetired_ is retaining
count of all the bytes that it has read as CodedInputStream is initiliazed once
for a container
[https://github.com/apache/hive/blob/564d7e54d2360488611da39d0e5f027a2d574fc1/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java#L96]
.
This is different from issue reproduced in
[https://github.com/zabetak/protobuf-large-message] as there it is a single
proto data file more than 2GB, but in my case, there are multiple file total
resulting in 2GB.
CC [~zabetak]
*Limitation:*
This fix will still not resolve the issue which is mentioned
[https://github.com/protocolbuffers/protobuf/issues/11729]
Here is table description:
{code:java}
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type
| comment |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type
| comment |
| eventtype | string
| from deserializer |
| hivequeryid | string
| from deserializer |
| timestamp | bigint
| from deserializer |
| executionmode | string
| from deserializer |
| requestuser | string
| from deserializer |
| queue | string
| from deserializer |
| user | string
| from deserializer |
| operationid | string
| from deserializer |
| tableswritten | array<string>
| from deserializer |
| tablesread | array<string>
| from deserializer |
| otherinfo | array<struct<key:string,value:string>>
| from deserializer |
| | NULL
| NULL |
| # Partition Information | NULL
| NULL |
| # col_name | data_type
| comment |
| date | string
| |
| | NULL
| NULL |
| # Detailed Table Information | NULL
| NULL |
| Database: | raghav
| NULL |
| OwnerType: | USER
| NULL |
| Owner: | hive
| NULL |
| CreateTime: | Tue Dec 19 11:05:48 PST 2023
| NULL |
| LastAccessTime: | UNKNOWN
| NULL |
| Retention: | 0
| NULL |
| Location: | hdfs://hostname:8020/user/raghav/query_data
| NULL |
| Table Type: | EXTERNAL_TABLE
| NULL |
| Table Parameters: | NULL
| NULL |
| | EXTERNAL
| TRUE |
| | bucketing_version
| 2 |
| | discover.partitions
| true |
| | numFiles
| 14 |
| | numPartitions
| 7 |
| | numRows
| 0 |
| | proto.class
| org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto |
| | proto.maptypes
| org.apache.hadoop.hive.ql.hooks.proto.MapFieldEntry |
| | rawDataSize
| 0 |
| | totalSize
| 62336619013 |
| | transient_lastDdlTime
| 1703012748 |
| | NULL
| NULL |
| # Storage Information | NULL
| NULL |
| SerDe Library: |
org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageSerDe | NULL
|
| InputFormat: |
org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageInputFormat | NULL
|
| OutputFormat: |
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
|
| Compressed: | No
| NULL |
| Num Buckets: | -1
| NULL |
| Bucket Columns: | []
| NULL |
| Sort Columns: | []
| NULL |
| Storage Desc Params: | NULL
| NULL |
| | serialization.format
| 1 |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
{code}
was:
{*}Query{*}: select * from _<table_name>_
{*}Explanation{*}:
On running the above mentioned query on a hive proto table, multiple tez
containers will be spawned to process the data. In a container, if there are
multiple hdfs splits and the combined size of decompressed data is more than
2GB then the query fails with the following error:
{code:java}
"While parsing a protocol message, the input ended unexpectedly in the middle
of a field. This could mean either that the input has been truncated or that
an embedded message misreported its own length." {code}
This is happening because of
_[CodedInputStream|https://github.com/protocolbuffers/protobuf/blob/54489e95e01882407f356f83c9074415e561db00/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L2712C7-L2712C16]
i.e. byteLimit += totalBytesRetired + pos;_
_byteLimit_ is __ getting InterOverflow as _totalBytesRetired_ is retaining
count of all the bytes that it has read as CodedInputStream is initiliazed once
for a container
[https://github.com/apache/hive/blob/564d7e54d2360488611da39d0e5f027a2d574fc1/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java#L96]
.
This is different from issue reproduced in
[https://github.com/zabetak/protobuf-large-message] as there it is a single
proto data file more than 2GB, but in my case, there are multiple file total
resulting in 2GB.
CC [~zabetak]
*Limitation:*
This fix will still not resolve the issue which is mentioned
[https://github.com/protocolbuffers/protobuf/issues/11729]
> Reading proto data more than 2GB from multiple splits fails
> -----------------------------------------------------------
>
> Key: HIVE-28026
> URL: https://issues.apache.org/jira/browse/HIVE-28026
> Project: Hive
> Issue Type: Bug
> Affects Versions: 4.0.0-beta-1
> Environment:
> Reporter: Raghav Aggarwal
> Assignee: Raghav Aggarwal
> Priority: Major
> Labels: pull-request-available
>
> {*}Query{*}: select * from _<table_name>_
> {*}Explanation{*}:
> On running the above mentioned query on a hive proto table, multiple tez
> containers will be spawned to process the data. In a container, if there are
> multiple hdfs splits and the combined size of decompressed data is more than
> 2GB then the query fails with the following error:
> {code:java}
> "While parsing a protocol message, the input ended unexpectedly in the middle
> of a field. This could mean either that the input has been truncated or that
> an embedded message misreported its own length." {code}
>
> This is happening because of
> _[CodedInputStream|https://github.com/protocolbuffers/protobuf/blob/54489e95e01882407f356f83c9074415e561db00/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L2712C7-L2712C16]
> i.e. byteLimit += totalBytesRetired + pos;_
> _byteLimit_ is __ getting InterOverflow as _totalBytesRetired_ is retaining
> count of all the bytes that it has read as CodedInputStream is initiliazed
> once for a container
> [https://github.com/apache/hive/blob/564d7e54d2360488611da39d0e5f027a2d574fc1/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java#L96]
> .
>
> This is different from issue reproduced in
> [https://github.com/zabetak/protobuf-large-message] as there it is a single
> proto data file more than 2GB, but in my case, there are multiple file total
> resulting in 2GB.
> CC [~zabetak]
> *Limitation:*
> This fix will still not resolve the issue which is mentioned
> [https://github.com/protocolbuffers/protobuf/issues/11729]
> Here is table description:
>
> {code:java}
> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
> | col_name | data_type
> | comment |
> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
> | # col_name | data_type
> | comment |
> | eventtype | string
> | from deserializer |
> | hivequeryid | string
> | from deserializer |
> | timestamp | bigint
> | from deserializer |
> | executionmode | string
> | from deserializer |
> | requestuser | string
> | from deserializer |
> | queue | string
> | from deserializer |
> | user | string
> | from deserializer |
> | operationid | string
> | from deserializer |
> | tableswritten | array<string>
> | from deserializer |
> | tablesread | array<string>
> | from deserializer |
> | otherinfo | array<struct<key:string,value:string>>
> | from deserializer |
> | | NULL
> | NULL |
> | # Partition Information | NULL
> | NULL |
> | # col_name | data_type
> | comment |
> | date | string
> | |
> | | NULL
> | NULL |
> | # Detailed Table Information | NULL
> | NULL |
> | Database: | raghav
> | NULL |
> | OwnerType: | USER
> | NULL |
> | Owner: | hive
> | NULL |
> | CreateTime: | Tue Dec 19 11:05:48 PST 2023
> | NULL |
> | LastAccessTime: | UNKNOWN
> | NULL |
> | Retention: | 0
> | NULL |
> | Location: | hdfs://hostname:8020/user/raghav/query_data
> | NULL |
> | Table Type: | EXTERNAL_TABLE
> | NULL |
> | Table Parameters: | NULL
> | NULL |
> | | EXTERNAL
> | TRUE |
> | | bucketing_version
> | 2 |
> | | discover.partitions
> | true |
> | | numFiles
> | 14 |
> | | numPartitions
> | 7 |
> | | numRows
> | 0 |
> | | proto.class
> |
> org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto |
> | | proto.maptypes
> | org.apache.hadoop.hive.ql.hooks.proto.MapFieldEntry |
> | | rawDataSize
> | 0 |
> | | totalSize
> | 62336619013 |
> | | transient_lastDdlTime
> | 1703012748 |
> | | NULL
> | NULL |
> | # Storage Information | NULL
> | NULL |
> | SerDe Library: |
> org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageSerDe | NULL
> |
> | InputFormat: |
> org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageInputFormat | NULL
> |
> | OutputFormat: |
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
> |
> | Compressed: | No
> | NULL |
> | Num Buckets: | -1
> | NULL |
> | Bucket Columns: | []
> | NULL |
> | Sort Columns: | []
> | NULL |
> | Storage Desc Params: | NULL
> | NULL |
> | | serialization.format
> | 1 |
> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)