[email protected] has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21143 )

Change subject: IMPALA-12856: Event processor should ignore processing 
partition with empty partition values
......................................................................


Patch Set 3:

(2 comments)

An input from my side.

http://gerrit.cloudera.org:8080/#/c/21143/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21143/2//COMMIT_MSG@7
PS2, Line 7: IMPALA-12856: Event processor should ignore processing partition
Not related to this patch. Currently ReloadEvent for partition doesn't use the 
field reloadPartition_ obtained from reload message. Instead, we reload the 
partitions using part names(uses getPartitionsByNames HMS API). 
getPartitionsByNames API(direct sql) is making multiple queries to db 
internally to build partition objects(query part ids from partition names, join 
partitions, sds, serdes tables for those part ids and forms partition objects, 
then query partition_key_vals table to get the partition values for those part 
ids and populate in already formed partition objects). May be, partition is 
deleted just before partition_key_vals table query and that resulted in values 
being emtpy in partition? I am not certain yet. I am trying to simulate it.

If we had made the use of reloadPartition_ received in ReloadEvent, probably we 
might have hit this issue?

https://github.com/apache/impala/blob/691604b1d1f0e5f0dc95fdb4976cf826135e08fb/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java#L3004C13-L3004C29


http://gerrit.cloudera.org:8080/#/c/21143/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21143/3//COMMIT_MSG@7
PS3, Line 7: IMPALA-12856: Event processor should ignore processing partition
Have checked 
org.apache.hadoop.hive.metastore.MetaStoreDirectSql#getPartitionsViaPartNames() 
HMS method. It is invoked when getPartitionsByNames API is called.
It internally executes multiple queries to backend db to make the partition 
objects. First query part ids from partition names, then a query to join 
partitions, sds, serdes tables for those part ids and create partition objects, 
and then query partition_key_vals table to get the partition values for those 
part ids and populate it in already created partition objects.
Actually many other queries are executed to populate the other fields in 
partition objects(like partition params, storage descriptor, storage descriptor 
params, serde, serde params sort cols, bucket cols, skewed cols etc).

If the partition is deleted in another transaction just before the 
partition_key_vals table query step in get partitions by names call above, it 
can lead to return empty values in the partition object. Same issue can happen 
for other fields too. Like partition params, storage descriptor param, serde 
params sort cols, bucket cols, skewed cols can be empty too.

I could reproduce it with a hive unittest having 2 threads(2 clients). One 
thread doing getPartitionsByNames and other thread doing dropPartition.



--
To view, visit http://gerrit.cloudera.org:8080/21143
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id2469930ccd74948325f1723bd8b2bd6aad02d09
Gerrit-Change-Number: 21143
Gerrit-PatchSet: 3
Gerrit-Owner: Sai Hemanth Gantasala <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Sai Hemanth Gantasala <[email protected]>
Gerrit-Comment-Date: Fri, 15 Mar 2024 18:13:32 +0000
Gerrit-HasComments: Yes

Reply via email to