Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/14037 )
Change subject: IMPALA-8839: Remove COLUMN_STATS_ACCURATE from properties ...................................................................... Patch Set 3: (15 comments) http://gerrit.cloudera.org:8080/#/c/14037/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/14037/3//COMMIT_MSG@10 PS3, Line 10: Remove COLUMN_STATS_ACCURATE Can you add some comments here (and probably in the code too) about why this is needed? E.g. "COLUMN_STATS_ACCURATE is used by Hive to decide if a stats can be trusted enough to answer queries like count(*). Impala currently does not set property "numRows", so it should be only treated as an estimate". http://gerrit.cloudera.org:8080/#/c/14037/3//COMMIT_MSG@14 PS3, Line 14: Implemented the stats change for both acid/non-acid tables. I needed a bit of time to think about what "stats change" means here - I think that "stats changes above" would make it clearer. http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/compat-hive-3/java/org/apache/impala/compat/MetastoreShim.java File fe/src/compat-hive-3/java/org/apache/impala/compat/MetastoreShim.java: http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/compat-hive-3/java/org/apache/impala/compat/MetastoreShim.java@110 PS3, Line 110: dbName, : String tblName These arguments could be taken from tbl http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/compat-hive-3/java/org/apache/impala/compat/MetastoreShim.java@144 PS3, Line 144: getValidWriteIdInTxn Needing a writeIdList here seems weird, as I cannot express what those writeIds actually mean. I understand that it is needed to call the API for transactional tables, but a comment about this being a hack would make the intention clearer. http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/compat-hive-3/java/org/apache/impala/compat/MetastoreShim.java@460 PS3, Line 460: private static String getValidWriteIdInTxn(IMetaStoreClient client, String dbName, name: writeID suggests a single number to me. I would prefer getValidWriteIdListInTxn http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/compat-hive-3/java/org/apache/impala/compat/MetastoreShim.java@462 PS3, Line 462: InvalidOperationException, MetaException, TException The pattern in this file seems to be to catch the Hive/thrift related exception and wrap them into a TransactionException. For functions that are not transaction related we could use ImpalaRuntimeException. The good things about Impala exceptions is that they often do not have to be wrapped in try/catch in the caller function, as many functions throw these already. http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java: http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@3711 PS3, Line 3711: nit: extra line http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@3762 PS3, Line 3762: tableWriteId = MetastoreShim.allocateTableWriteId( : msClient.getHiveClient(), transactionId, : table.getDb().getName(), table.getName()); Do we need to allocate a new writeId here, or the writeId used for the insert could be reused? (provided that somehow it becomes available here) Allocating 2 writeIds for the same transaction + table seems a bit weird to me. I am ok with the current solution, but some comments about this decision would be useful here. http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@3770 PS3, Line 3770: errorMessages I think that errorMessages was created specifically for HDFS caching related errors - these are non-fatal, and error messages contain instructions for the user on how to solve the issue. So for COLUMN_STATS_ACCURATE related errors I think we should not catch the exceptions and let them fail the query. http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@3890 PS3, Line 3890: optional, refactor: Moving the next block to a function like UnsetPartitionColStatAccurate() could make updateCatalog() shorter. http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@3942 PS3, Line 3942: nit: extra line http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@4314 PS3, Line 4314: * Update table properties to remove the COLUMN_STATS_ACCURATE entry if it exist. nit: exists http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@4316 PS3, Line 4316: UnsetTableColStats consistency: lower U naming: I think that unsetTableColStatsAccurate would be a better name, as unsetting colStats suggests me something like DROP STATS. http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@4321 PS3, Line 4321: DO_NOT_UPDATE_STATS I think that this needs some explanation, see https://issues.apache.org/jira/browse/HIVE-15653 My understanding is that in the Hive 2 world, we did not want Hive to discard stats, even if they became inaccurate, as they could still be useful estimates. I am not sure that the same is true in Hive 3, as the Table object passed between Impala and HMS now contains colStats (added in HIVE-21078), so alter_table would not reset it even if DO_NOT_UPDATE_STATS is not there. Adding a comment like // See HIVE-15653 for details. // TODO: is this still needed in Hive 3? (see HIVE-21078) could give some directions for the future reader. http://gerrit.cloudera.org:8080/#/c/14037/3/tests/query_test/test_acid.py File tests/query_test/test_acid.py: http://gerrit.cloudera.org:8080/#/c/14037/3/tests/query_test/test_acid.py@104 PS3, Line 104: test_ext_statschg optional, code size reduction: the two .test files seem very similar, the only difference I see is in table creation. I see two ways to merge the 2 .test files (besides using common table names): - move table creation to python code - create the tables with Impala instead of Hive, and use query option DEFAULT_TRANSACTIONAL_TYPE to switch between non-ACID and insert only ACID tables. This can be done with vector.get_value('exec_option')['default_transactional_type'] = 'insert_only'. -- To view, visit http://gerrit.cloudera.org:8080/14037 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I13f4a77022a7112e10a07314359f927eae083deb Gerrit-Change-Number: 14037 Gerrit-PatchSet: 3 Gerrit-Owner: Yongzhi Chen <yc...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Comment-Date: Fri, 09 Aug 2019 18:59:50 +0000 Gerrit-HasComments: Yes