[jira] [Commented] (HIVE-28026) Reading proto data more than 2GB from multiple splits fails

Stamatis Zampetakis (Jira) Wed, 24 Jan 2024 05:57:04 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-28026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810395#comment-17810395
 ]


Stamatis Zampetakis commented on HIVE-28026:
--------------------------------------------

Thanks for logging this [~Aggarwal_Raghav].

Please add the DDL for the table that is causing the failure in the description 
of this ticket and any other details needed so that people reading this ticket 
can fully understand the use-case and repro setup. 

Moreover the problem seems something that could be reproduced via unit test 
since I guess it suffices to generate 2GB of data and then read them back. I 
don't know if the test would be committable but it would definitely aid in the 
understanding and reviewing the PR.

> Reading proto data more than 2GB from multiple splits fails
> -----------------------------------------------------------
>
>                 Key: HIVE-28026
>                 URL: https://issues.apache.org/jira/browse/HIVE-28026
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.0.0-beta-1
>         Environment:   
>            Reporter: Raghav Aggarwal
>            Assignee: Raghav Aggarwal
>            Priority: Major
>              Labels: pull-request-available
>
> {*}Query{*}: select * from _<table_name>_
> {*}Explanation{*}:
> On running the above mentioned query on a hive proto table, multiple tez 
> containers will be spawned to process the data. In a container, if there are 
> multiple hdfs splits and the combined size of decompressed data is more than 
> 2GB then the query fails with the following error:
> {code:java}
> "While parsing a protocol message, the input ended unexpectedly in the middle 
> of a field.  This could mean either that the input has been truncated or that 
> an embedded message misreported its own length." {code}
>  
> This is happening because of 
> _[CodedInputStream|https://github.com/protocolbuffers/protobuf/blob/54489e95e01882407f356f83c9074415e561db00/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L2712C7-L2712C16]
>  i.e. byteLimit += totalBytesRetired + pos;_
> _byteLimit_ is __ getting InterOverflow as _totalBytesRetired_ is retaining 
> count of all the bytes that it has read as CodedInputStream is initiliazed 
> once for a container 
> [https://github.com/apache/hive/blob/564d7e54d2360488611da39d0e5f027a2d574fc1/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java#L96]
>  . 
>  
> This is different from issue reproduced in 
> [https://github.com/zabetak/protobuf-large-message] as there it is a single 
> proto data file more than 2GB, but in my case, there are multiple file total 
> resulting in 2GB.
> CC [~zabetak] 
> *Limitation:*
> This fix will still not resolve the issue which is mentioned 
> [https://github.com/protocolbuffers/protobuf/issues/11729] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-28026) Reading proto data more than 2GB from multiple splits fails

Reply via email to