[jira] [Created] (HIVE-28026) Reading proto data more than 2GB from multiple splits fails

Raghav Aggarwal (Jira) Wed, 24 Jan 2024 05:17:03 -0800

Raghav Aggarwal created HIVE-28026:
--------------------------------------

             Summary: Reading proto data more than 2GB from multiple splits 
fails
                 Key: HIVE-28026
                 URL: https://issues.apache.org/jira/browse/HIVE-28026
             Project: Hive
          Issue Type: Bug
    Affects Versions: 4.0.0-beta-1
         Environment:   
            Reporter: Raghav Aggarwal
            Assignee: Raghav Aggarwal



{*}Query{*}: select * from _<table_name>_

{*}Explanation{*}:

On running the above mentioned query on a hive proto table, multiple tez 
containers will be spawned to process the data. In a container, if there are 
multiple hdfs splits and the combined size of decompressed data is more than 
2GB then the query fails with the following error:

 
{code:java}
"While parsing a protocol message, the input ended unexpectedly in the middle 
of a field.  This could mean either that the input has been truncated or that 
an embedded message misreported its own length." {code}
 

 

This is happening because of 
_[CodedInputStream|https://github.com/protocolbuffers/protobuf/blob/54489e95e01882407f356f83c9074415e561db00/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L2712C7-L2712C16]
 i.e. byteLimit += totalBytesRetired + pos;_

_byteLimit_ is __ getting InterOverflow as _totalBytesRetired_ is retaining all 
the bytes that it has read as CodedInputStream is initiliazed once for a 
container 
[https://github.com/apache/hive/blob/564d7e54d2360488611da39d0e5f027a2d574fc1/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java#L96]
 . 

 

This is different from issue reproduced in 
[https://github.com/zabetak/protobuf-large-message] as there it is a single 
proto data file more than 2GB, but in my case, there are multiple file total 
resulting in 2GB.

 

*Limitation:*

This fix will still not resolve the issue which is mentioned 
[https://github.com/protocolbuffers/protobuf/issues/11729] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-28026) Reading proto data more than 2GB from multiple splits fails

Reply via email to