Yu-Wen Lai has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18043 )

Change subject: IMPALA-11032: Automatic Refresh of Metadata for Local Catalog 
after Compaction
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18043/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18043/2//COMMIT_MSG@10
PS2, Line 10: After compaction happened in Hive(HIVE ACID table), queries made 
in
            : Impala possibly fail with a FileNotFoundException if files already
            : removed by the Hive cleaner.
> IIRC, Impala only open transactions for DDL/DML operations. Do you know how
Thank Vihang and Quanlong for letting me know the problem. Impala does NOT open 
transactions for select queries so this approach doesn't work all the time...

Hive has a config that can delay the cleaner some period of time but we don't 
know exactly how long we should extend.
Given that this is time sensitive, I'm thinking we could make this feature 
optional for now. If this flag is set, say auto_check_compaction, let Impala 
open transactions for all the queries for ACID tables and do the compaction 
checking. Any thoughts?


http://gerrit.cloudera.org:8080/#/c/18043/2/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java
File fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java:

http://gerrit.cloudera.org:8080/#/c/18043/2/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java@898
PS2, Line 898: List<PartitionRef> stalePartitions = 
directProvider_.checkLatestCompaction(
             :         refImpl.dbName_, refImpl.tableName_, refImpl, refToMeta);
> I think this introduces several HMS RPCs per query (some queries may call t
If we take the performance numbers on DWX as example, currently this API call 
takes 10 ~ 40 ms per table depending on the number of partitions. I will have a 
fix on the HMS side to solve an issue around this API that we need to pass all 
the partition names. That should make all the API execution time close to 10 ms.

Even though we can make some improvement around this API, I understand this is 
still introduce the overhead that might not neglectable. It might be better to 
introduce this feature with a flag and the table property to skip this check as 
Quanlong suggested.



--
To view, visit http://gerrit.cloudera.org:8080/18043
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I173ea848917b6a41139b25b80677111463bfdc4b
Gerrit-Change-Number: 18043
Gerrit-PatchSet: 2
Gerrit-Owner: Yu-Wen Lai <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Sourabh Goyal <[email protected]>
Gerrit-Reviewer: Vihang Karajgaonkar <[email protected]>
Gerrit-Reviewer: Yu-Wen Lai <[email protected]>
Gerrit-Comment-Date: Mon, 29 Nov 2021 02:56:40 +0000
Gerrit-HasComments: Yes

Reply via email to