Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/19391 )
Change subject: IMPALA-11812: Deduplicate column schema in hmsPartitions ...................................................................... Patch Set 7: (10 comments) http://gerrit.cloudera.org:8080/#/c/19391/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19391/6//COMMIT_MSG@15 PS6, Line 15: interned > nit. internalized "interned" seems being used more widely, e.g. https://stackoverflow.com/questions/10578984/what-is-java-string-interning "git grep interned" also show me some usages: be/src/kudu/util/trace_metrics.cc: "Too many interned strings: " << *g_intern_map; be/src/kudu/util/trace_metrics.h: // pool. If this string has already been interned, returns a pointer shell/ext-py/thrift-0.14.2/src/ext/module.cpp:/** Pointer to interned string to speed up attribute lookup. */ http://gerrit.cloudera.org:8080/#/c/19391/6//COMMIT_MSG@18 PS6, Line 18: In real > nit. In reality. Done http://gerrit.cloudera.org:8080/#/c/19391/6//COMMIT_MSG@21 PS6, Line 21: > nit. replaces Done http://gerrit.cloudera.org:8080/#/c/19391/6//COMMIT_MSG@22 PS6, Line 22: This patch replaces column list in StorageDescriptor of hmsPartitions > nit. to remove the duplications. Done http://gerrit.cloudera.org:8080/#/c/19391/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/19391/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1862 PS6, Line 1862: fetchPartitionsByName > I think the name is ok since the 'ByName' refers to the names of the partit Yeah, the key argument is the list of partition names. http://gerrit.cloudera.org:8080/#/c/19391/6/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java: http://gerrit.cloudera.org:8080/#/c/19391/6/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@5724 PS6, Line 5724: LOG.info("Updated cache directive id for {}/{} partitions for table {}", > nit. may append "for table <t>" for clarity. Done http://gerrit.cloudera.org:8080/#/c/19391/6/fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java File fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java: http://gerrit.cloudera.org:8080/#/c/19391/6/fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java@208 PS6, Line 208: LOG.info("Fetched {}/{} partitions for table {}.{}", numDone, partNames.size(), : msTbl.getDbName(), msTbl.getTableName()); > Actually, the number of logged messages would be partNames.size()/maxPartit This will generate #batches of lines. So if only one batch is fetched, one line is logged. I think it's ok and it doesn't worth complicate the logging code for the single batch case. Note that each batch fetch takes 10s, and it takes more time if there are gc pressures. So the progress log is helpful for showing what's happening now. http://gerrit.cloudera.org:8080/#/c/19391/6/fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java@230 PS6, Line 230: e the p > May need to clarify. Do you mean use? Yeah, Impala doesn't use it and will ignore it even if it's inconsistent with the table level schema. http://gerrit.cloudera.org:8080/#/c/19391/6/tests/custom_cluster/test_wide_table_operations.py File tests/custom_cluster/test_wide_table_operations.py: http://gerrit.cloudera.org:8080/#/c/19391/6/tests/custom_cluster/test_wide_table_operations.py@36 PS6, Line 36: if cls.exploration_strategy() != 'exhaustive': Yeah, the progress is helpful so I added some progress logs for catalogd. execute_query_expect_success() and hdfs CLIs will log the command. However, they are only shown at the end if the test fails. For getting the progress, we can only check logs of impalad and catalogd currently. Each FieldSchema takes 24 bytes in a small heap (<32GB). Without the fix, catalogd will hold at least 50000 (parts) * 2000 (cols) = 100,000,000 FieldSchema instances, which already takes more than 2GB and will cause OOM finally. Note that we set -Xmx2g in the jvm args. I'll add these to the comments. > how long does table creation itself take ? Table creation only takes a few seconds. Uploading files to HDFS takes a few minutes. However, the majority of the time is spent in query execution. I check subprocesses of pytest to see whether HDFS upload finishes. http://gerrit.cloudera.org:8080/#/c/19391/6/tests/custom_cluster/test_wide_table_operations.py@71 PS6, Line 71: > nit. Is this the time after the patch or before? After. Before the fix, catalogd will OOM and fail the statement. -- To view, visit http://gerrit.cloudera.org:8080/19391 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I511ecca0ace8bea4c24a19a54fb0a75390e50c4d Gerrit-Change-Number: 19391 Gerrit-PatchSet: 7 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Fri, 30 Dec 2022 03:39:34 +0000 Gerrit-HasComments: Yes
