Aggarwal-Raghav opened a new pull request, #5033:
URL: https://github.com/apache/hive/pull/5033

   ### What changes were proposed in this pull request?
   [HIVE-28026](https://issues.apache.org/jira/browse/HIVE-28026)
   
   
   ### Why are the changes needed?
   **Query**: select * from <table_name>
   
   **Explanation**:
   On running the above mentioned query on a hive proto table, multiple tez 
containers will be spawned to process the data. In a container, if there are 
multiple hdfs splits and the combined size of decompressed data is more than 
2GB then the query fails with the following error:
   `"While parsing a protocol message, the input ended unexpectedly in the 
middle of a field.  This could mean either that the input has been truncated or 
that an embedded message misreported its own length."`
   
   This is happening because of 
[CodedInputStream](https://github.com/protocolbuffers/protobuf/blob/54489e95e01882407f356f83c9074415e561db00/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L2712C7-L2712C16)
 i.e. _byteLimit += totalBytesRetired + pos;_
   _byteLimit_ is getting InterOverflow as _totalBytesRetired_ is retaining all 
the bytes that it has read as CodedInputStream is initiliazed once for a 
container. 
https://github.com/apache/hive/blob/564d7e54d2360488611da39d0e5f027a2d574fc1/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java#L96
   
   This is different from issue reproduced in 
https://github.com/zabetak/protobuf-large-message as there it is a single proto 
data file more than 2GB, but in my case, there are multiple file total 
resulting in 2GB.
   
   **Limitation**:
   This fix will still not resolve the issue which is mentioned 
https://github.com/protocolbuffers/protobuf/issues/11729 
   
   ### Does this PR introduce _any_ user-facing change?
   NO
   
   ### Is the change a dependency upgrade?
   NO
   
   
   ### How was this patch tested?
   On a cluster
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to