[email protected] has posted comments on this change. ( http://gerrit.cloudera.org:8080/21143 )
Change subject: IMPALA-12856: Event processor should ignore processing partition with empty partition values ...................................................................... Patch Set 3: (2 comments) An input from my side. http://gerrit.cloudera.org:8080/#/c/21143/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21143/2//COMMIT_MSG@7 PS2, Line 7: IMPALA-12856: Event processor should ignore processing partition Not related to this patch. Currently ReloadEvent for partition doesn't use the field reloadPartition_ obtained from reload message. Instead, we reload the partitions using part names(uses getPartitionsByNames HMS API). getPartitionsByNames API(direct sql) is making multiple queries to db internally to build partition objects(query part ids from partition names, join partitions, sds, serdes tables for those part ids and forms partition objects, then query partition_key_vals table to get the partition values for those part ids and populate in already formed partition objects). May be, partition is deleted just before partition_key_vals table query and that resulted in values being emtpy in partition? I am not certain yet. I am trying to simulate it. If we had made the use of reloadPartition_ received in ReloadEvent, probably we might have hit this issue? https://github.com/apache/impala/blob/691604b1d1f0e5f0dc95fdb4976cf826135e08fb/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java#L3004C13-L3004C29 http://gerrit.cloudera.org:8080/#/c/21143/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21143/3//COMMIT_MSG@7 PS3, Line 7: IMPALA-12856: Event processor should ignore processing partition Have checked org.apache.hadoop.hive.metastore.MetaStoreDirectSql#getPartitionsViaPartNames() HMS method. It is invoked when getPartitionsByNames API is called. It internally executes multiple queries to backend db to make the partition objects. First query part ids from partition names, then a query to join partitions, sds, serdes tables for those part ids and create partition objects, and then query partition_key_vals table to get the partition values for those part ids and populate it in already created partition objects. Actually many other queries are executed to populate the other fields in partition objects(like partition params, storage descriptor, storage descriptor params, serde, serde params sort cols, bucket cols, skewed cols etc). If the partition is deleted in another transaction just before the partition_key_vals table query step in get partitions by names call above, it can lead to return empty values in the partition object. Same issue can happen for other fields too. Like partition params, storage descriptor param, serde params sort cols, bucket cols, skewed cols can be empty too. I could reproduce it with a hive unittest having 2 threads(2 clients). One thread doing getPartitionsByNames and other thread doing dropPartition. -- To view, visit http://gerrit.cloudera.org:8080/21143 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Id2469930ccd74948325f1723bd8b2bd6aad02d09 Gerrit-Change-Number: 21143 Gerrit-PatchSet: 3 Gerrit-Owner: Sai Hemanth Gantasala <[email protected]> Gerrit-Reviewer: Anonymous Coward <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Sai Hemanth Gantasala <[email protected]> Gerrit-Comment-Date: Fri, 15 Mar 2024 18:13:32 +0000 Gerrit-HasComments: Yes
