[jira] [Updated] (HIVE-28026) Reading proto data more than 2GB from multiple splits fails

Raghav Aggarwal (Jira) Wed, 24 Jan 2024 08:02:19 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-28026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Raghav Aggarwal updated HIVE-28026:
-----------------------------------
    Description: 
{*}Query{*}: select * from _<table_name>_

{*}Explanation{*}:

On running the above mentioned query on a hive proto table, multiple tez 
containers will be spawned to process the data. In a container, if there are 
multiple hdfs splits and the combined size of decompressed data is more than 
2GB then the query fails with the following error:
{code:java}
"While parsing a protocol message, the input ended unexpectedly in the middle 
of a field.  This could mean either that the input has been truncated or that 
an embedded message misreported its own length." {code}
 

This is happening because of 
_[CodedInputStream|https://github.com/protocolbuffers/protobuf/blob/54489e95e01882407f356f83c9074415e561db00/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L2712C7-L2712C16]
 i.e. byteLimit += totalBytesRetired + pos;_

_byteLimit_ is __ getting InterOverflow as _totalBytesRetired_ is retaining 
count of all the bytes that it has read as CodedInputStream is initiliazed once 
for a container 
[https://github.com/apache/hive/blob/564d7e54d2360488611da39d0e5f027a2d574fc1/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java#L96]
 . 

 

This is different from issue reproduced in 
[https://github.com/zabetak/protobuf-large-message] as there it is a single 
proto data file more than 2GB, but in my case, there are multiple file total 
resulting in 2GB.

CC [~zabetak] 

*Limitation:*

This fix will still not resolve the issue which is mentioned 
[https://github.com/protocolbuffers/protobuf/issues/11729] 

Here is DDL:

 
{code:java}
beeline  -u 
'jdbc:hive2://hostnames/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;thrift.client.max.message.size=2147483647'
 --showHeader=false --outputformat=tsv2 -e "select * from 
raaggarw.proto_hive_query_data where executionmode='MR' and otherinfo['CONF'] 
!= 'NULL'" >> ./output {code}
 

  was:
{*}Query{*}: select * from _<table_name>_

{*}Explanation{*}:

On running the above mentioned query on a hive proto table, multiple tez 
containers will be spawned to process the data. In a container, if there are 
multiple hdfs splits and the combined size of decompressed data is more than 
2GB then the query fails with the following error:
{code:java}
"While parsing a protocol message, the input ended unexpectedly in the middle 
of a field.  This could mean either that the input has been truncated or that 
an embedded message misreported its own length." {code}
 

This is happening because of 
_[CodedInputStream|https://github.com/protocolbuffers/protobuf/blob/54489e95e01882407f356f83c9074415e561db00/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L2712C7-L2712C16]
 i.e. byteLimit += totalBytesRetired + pos;_

_byteLimit_ is __ getting InterOverflow as _totalBytesRetired_ is retaining 
count of all the bytes that it has read as CodedInputStream is initiliazed once 
for a container 
[https://github.com/apache/hive/blob/564d7e54d2360488611da39d0e5f027a2d574fc1/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java#L96]
 . 

 

This is different from issue reproduced in 
[https://github.com/zabetak/protobuf-large-message] as there it is a single 
proto data file more than 2GB, but in my case, there are multiple file total 
resulting in 2GB.

CC [~zabetak] 

*Limitation:*

This fix will still not resolve the issue which is mentioned 
[https://github.com/protocolbuffers/protobuf/issues/11729] 

Here is table description:

 
{code:java}
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
|           col_name            |                     data_type                 
     |                      comment                       |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name                    | data_type                                     
     | comment                                            |
| eventtype                     | string                                        
     | from deserializer                                  |
| hivequeryid                   | string                                        
     | from deserializer                                  |
| timestamp                     | bigint                                        
     | from deserializer                                  |
| executionmode                 | string                                        
     | from deserializer                                  |
| requestuser                   | string                                        
     | from deserializer                                  |
| queue                         | string                                        
     | from deserializer                                  |
| user                          | string                                        
     | from deserializer                                  |
| operationid                   | string                                        
     | from deserializer                                  |
| tableswritten                 | array<string>                                 
     | from deserializer                                  |
| tablesread                    | array<string>                                 
     | from deserializer                                  |
| otherinfo                     | array<struct<key:string,value:string>>        
     | from deserializer                                  |
|                               | NULL                                          
     | NULL                                               |
| # Partition Information       | NULL                                          
     | NULL                                               |
| # col_name                    | data_type                                     
     | comment                                            |
| date                          | string                                        
     |                                                    |
|                               | NULL                                          
     | NULL                                               |
| # Detailed Table Information  | NULL                                          
     | NULL                                               |
| Database:                     | raghav                                        
     | NULL                                               |
| OwnerType:                    | USER                                          
     | NULL                                               |
| Owner:                        | hive                                          
     | NULL                                               |
| CreateTime:                   | Tue Dec 19 11:05:48 PST 2023                  
     | NULL                                               |
| LastAccessTime:               | UNKNOWN                                       
     | NULL                                               |
| Retention:                    | 0                                             
     | NULL                                               |
| Location:                     | hdfs://hostname:8020/user/raghav/query_data   
     | NULL                                               |
| Table Type:                   | EXTERNAL_TABLE                                
     | NULL                                               |
| Table Parameters:             | NULL                                          
     | NULL                                               |
|                               | EXTERNAL                                      
     | TRUE                                               |
|                               | bucketing_version                             
     | 2                                                  |
|                               | discover.partitions                           
     | true                                               |
|                               | numFiles                                      
     | 14                                                 |
|                               | numPartitions                                 
     | 7                                                  |
|                               | numRows                                       
     | 0                                                  |
|                               | proto.class                                   
     | org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto |
|                               | proto.maptypes                                
     | org.apache.hadoop.hive.ql.hooks.proto.MapFieldEntry |
|                               | rawDataSize                                   
     | 0                                                  |
|                               | totalSize                                     
     | 62336619013                                        |
|                               | transient_lastDdlTime                         
     | 1703012748                                         |
|                               | NULL                                          
     | NULL                                               |
| # Storage Information         | NULL                                          
     | NULL                                               |
| SerDe Library:                | 
org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageSerDe | NULL               
                                |
| InputFormat:                  | 
org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageInputFormat | NULL         
                                      |
| OutputFormat:                 | 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL               
                                |
| Compressed:                   | No                                            
     | NULL                                               |
| Num Buckets:                  | -1                                            
     | NULL                                               |
| Bucket Columns:               | []                                            
     | NULL                                               |
| Sort Columns:                 | []                                            
     | NULL                                               |
| Storage Desc Params:          | NULL                                          
     | NULL                                               |
|                               | serialization.format                          
     | 1                                                  |
+-------------------------------+----------------------------------------------------+----------------------------------------------------+
{code}
 


> Reading proto data more than 2GB from multiple splits fails
> -----------------------------------------------------------
>
>                 Key: HIVE-28026
>                 URL: https://issues.apache.org/jira/browse/HIVE-28026
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.0.0-beta-1
>         Environment:   
>            Reporter: Raghav Aggarwal
>            Assignee: Raghav Aggarwal
>            Priority: Major
>              Labels: pull-request-available
>
> {*}Query{*}: select * from _<table_name>_
> {*}Explanation{*}:
> On running the above mentioned query on a hive proto table, multiple tez 
> containers will be spawned to process the data. In a container, if there are 
> multiple hdfs splits and the combined size of decompressed data is more than 
> 2GB then the query fails with the following error:
> {code:java}
> "While parsing a protocol message, the input ended unexpectedly in the middle 
> of a field.  This could mean either that the input has been truncated or that 
> an embedded message misreported its own length." {code}
>  
> This is happening because of 
> _[CodedInputStream|https://github.com/protocolbuffers/protobuf/blob/54489e95e01882407f356f83c9074415e561db00/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L2712C7-L2712C16]
>  i.e. byteLimit += totalBytesRetired + pos;_
> _byteLimit_ is __ getting InterOverflow as _totalBytesRetired_ is retaining 
> count of all the bytes that it has read as CodedInputStream is initiliazed 
> once for a container 
> [https://github.com/apache/hive/blob/564d7e54d2360488611da39d0e5f027a2d574fc1/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java#L96]
>  . 
>  
> This is different from issue reproduced in 
> [https://github.com/zabetak/protobuf-large-message] as there it is a single 
> proto data file more than 2GB, but in my case, there are multiple file total 
> resulting in 2GB.
> CC [~zabetak] 
> *Limitation:*
> This fix will still not resolve the issue which is mentioned 
> [https://github.com/protocolbuffers/protobuf/issues/11729] 
> Here is DDL:
>  
> {code:java}
> beeline  -u 
> 'jdbc:hive2://hostnames/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;thrift.client.max.message.size=2147483647'
>  --showHeader=false --outputformat=tsv2 -e "select * from 
> raaggarw.proto_hive_query_data where executionmode='MR' and otherinfo['CONF'] 
> != 'NULL'" >> ./output {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-28026) Reading proto data more than 2GB from multiple splits fails

Reply via email to