[Impala-ASF-CR] IMPALA-8839: Remove COLUMN STATS ACCURATE from properties

Csaba Ringhofer (Code Review) Fri, 09 Aug 2019 12:00:35 -0700

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14037 )


Change subject: IMPALA-8839: Remove COLUMN_STATS_ACCURATE from properties
......................................................................


Patch Set 3:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/14037/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14037/3//COMMIT_MSG@10
PS3, Line 10: Remove COLUMN_STATS_ACCURATE
Can you add some comments here (and probably in the code too) about why this is 
needed? E.g. "COLUMN_STATS_ACCURATE is used by Hive to decide if a stats can be 
trusted enough to answer queries like count(*). Impala currently does not set 
property "numRows", so it should be only treated as an estimate".


http://gerrit.cloudera.org:8080/#/c/14037/3//COMMIT_MSG@14
PS3, Line 14: Implemented the stats change for both acid/non-acid tables.
I needed a bit of time to think about what "stats change" means here - I think 
that "stats changes above" would make it clearer.


http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/compat-hive-3/java/org/apache/impala/compat/MetastoreShim.java
File fe/src/compat-hive-3/java/org/apache/impala/compat/MetastoreShim.java:

http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/compat-hive-3/java/org/apache/impala/compat/MetastoreShim.java@110
PS3, Line 110: dbName,
             :      String tblName
These arguments could be taken from tbl


http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/compat-hive-3/java/org/apache/impala/compat/MetastoreShim.java@144
PS3, Line 144: getValidWriteIdInTxn
Needing a writeIdList here seems weird, as I cannot express what those writeIds 
actually mean.

I understand that it is needed to call the API for transactional tables, but a 
comment about this being a hack would make the intention clearer.


http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/compat-hive-3/java/org/apache/impala/compat/MetastoreShim.java@460
PS3, Line 460:   private static String getValidWriteIdInTxn(IMetaStoreClient 
client, String dbName,
name: writeID suggests a single number to me. I would prefer 
getValidWriteIdListInTxn


http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/compat-hive-3/java/org/apache/impala/compat/MetastoreShim.java@462
PS3, Line 462: InvalidOperationException, MetaException, TException
The pattern in this file seems to be to catch the Hive/thrift related exception 
and wrap them into a TransactionException. For functions that are not 
transaction related we could use ImpalaRuntimeException. The good things about 
Impala exceptions is that they often do not have to be wrapped in try/catch in 
the caller function, as many functions throw these already.


http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java:

http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@3711
PS3, Line 3711:
nit: extra line


http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@3762
PS3, Line 3762:               tableWriteId = MetastoreShim.allocateTableWriteId(
              :                   msClient.getHiveClient(), transactionId,
              :                   table.getDb().getName(), table.getName());
Do we need to allocate a new writeId here, or the writeId used for the insert 
could be reused? (provided that somehow it becomes available here) Allocating 2 
writeIds for the same transaction + table seems a bit weird to me.

I am ok with the current solution, but some comments about this decision would 
be useful here.


http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@3770
PS3, Line 3770: errorMessages
I think that errorMessages was created specifically for HDFS caching related 
errors - these are non-fatal, and error messages contain instructions for the 
user on how to solve the issue.

So for COLUMN_STATS_ACCURATE related errors I think we should not catch the 
exceptions and let them fail the query.


http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@3890
PS3, Line 3890:
optional, refactor: Moving the next block to a function like 
UnsetPartitionColStatAccurate() could make updateCatalog() shorter.


http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@3942
PS3, Line 3942:
nit: extra line


http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@4314
PS3, Line 4314:    * Update table properties to remove the 
COLUMN_STATS_ACCURATE entry if it exist.
nit: exists


http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@4316
PS3, Line 4316: UnsetTableColStats
consistency: lower U

naming: I think that unsetTableColStatsAccurate would be a better name, as 
unsetting colStats suggests me something like DROP STATS.


http://gerrit.cloudera.org:8080/#/c/14037/3/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@4321
PS3, Line 4321: DO_NOT_UPDATE_STATS
I think that this needs some explanation, see 
https://issues.apache.org/jira/browse/HIVE-15653

My understanding is that in the Hive 2 world, we did not want Hive to discard 
stats, even if they became inaccurate, as they could still be useful estimates.

I am not sure that the same is true in Hive 3, as the Table object passed 
between Impala and HMS now contains colStats (added in HIVE-21078), so 
alter_table would not reset it even if DO_NOT_UPDATE_STATS is not there.

Adding a comment like
// See HIVE-15653 for details.
// TODO: is this still needed in Hive 3? (see HIVE-21078)
could give some directions for the future reader.


http://gerrit.cloudera.org:8080/#/c/14037/3/tests/query_test/test_acid.py
File tests/query_test/test_acid.py:

http://gerrit.cloudera.org:8080/#/c/14037/3/tests/query_test/test_acid.py@104
PS3, Line 104: test_ext_statschg
optional, code size reduction: the two .test files seem very similar, the only 
difference I see is in table creation.

I see two ways to merge the 2 .test files (besides using common table names):
- move table creation to python code
- create the tables with Impala instead of Hive, and use query option 
DEFAULT_TRANSACTIONAL_TYPE to switch between non-ACID and insert only ACID 
tables. This can be done with 
vector.get_value('exec_option')['default_transactional_type'] = 'insert_only'.



--
To view, visit http://gerrit.cloudera.org:8080/14037
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I13f4a77022a7112e10a07314359f927eae083deb
Gerrit-Change-Number: 14037
Gerrit-PatchSet: 3
Gerrit-Owner: Yongzhi Chen <yc...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Comment-Date: Fri, 09 Aug 2019 18:59:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8839: Remove COLUMN STATS ACCURATE from properties

Reply via email to