Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/11193 )
Change subject: IMPALA-7425: Change incremental stats to pull from catalogd. ...................................................................... Patch Set 2: (10 comments) only got partially through but have some meetings for the rest of the morning and early afternoon so figured i'd send along what i got so far http://gerrit.cloudera.org:8080/#/c/11193/2/common/thrift/CatalogService.thrift File common/thrift/CatalogService.thrift: http://gerrit.cloudera.org:8080/#/c/11193/2/common/thrift/CatalogService.thrift@379 PS2, Line 379: statistcs typo http://gerrit.cloudera.org:8080/#/c/11193/2/common/thrift/CatalogService.thrift@382 PS2, Line 382: 1: required CatalogObjects.TTableName table_name : 2: required list<i64> partition_ids it seems risky to use partition IDs without being tied to a particular version number of the table. If there is a concurrent refresh or reload of the table, we'd end up pulling stats for the wrong partitions or incorrectly deciding that there were no stats available when there were. Partition names are a more stable identifier http://gerrit.cloudera.org:8080/#/c/11193/2/common/thrift/CatalogService.thrift@387 PS2, Line 387: statistcs typo http://gerrit.cloudera.org:8080/#/c/11193/2/common/thrift/CatalogService.thrift@388 PS2, Line 388: // If there was an error, an empty partition_stats mapping is returned along with - seems like it would be easier to just make the returned map optional so youu dont need to return an empty one in the error case. - can you clarify what happens when you request stats for a partition that does not have them? Do you get a map entry or just a "missing" entry? http://gerrit.cloudera.org:8080/#/c/11193/2/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java: http://gerrit.cloudera.org:8080/#/c/11193/2/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@336 PS2, Line 336: LOG.info("analyzing compute stats"); probably didn't mean to leave this in at INFO level http://gerrit.cloudera.org:8080/#/c/11193/2/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@451 PS2, Line 451: !RuntimeEnv.INSTANCE.isTestEnv() why not isTestEnv? http://gerrit.cloudera.org:8080/#/c/11193/2/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@456 PS2, Line 456: fetchedPartitionStats = fetchPartitionStats(analyzer, partStatsToFetch); it seems some of this code ends up duplicated in a few places. Did you consider adding a flag to 'loadPartitions' and 'loadAllPartitions' which says 'NEED_INCREMENTAL_STATS'? Or maybe this code could end up being a little cleaner if you always follow the same code path to "fetch stats" into 'fetchedPartitionStats', but in the "old" code path, you just "fetch" them by copying the reference from the partition object? I think that could reduce the branching factor of the code below and allow you to consolidate all the code into 'fetchPartitionStats' http://gerrit.cloudera.org:8080/#/c/11193/2/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java File fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java: http://gerrit.cloudera.org:8080/#/c/11193/2/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@415 PS2, Line 415: // Table is not loaded. : if (!table.isLoaded()) { : throw new CatalogException("Partition statistics not available for not yet " : + " loaded table: " + request.table_name); can this be the case? I thought if it's FeFsTable then it's guaranteed to be loaded http://gerrit.cloudera.org:8080/#/c/11193/2/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@428 PS2, Line 428: tableName.getDb_name() + "." + tableName.getTable_name() nit: fsTable.getFullName() http://gerrit.cloudera.org:8080/#/c/11193/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/11193/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@524 PS2, Line 524: may not be serialized and sent from catalogd to impalads. I find this usage of the word "may" unclear -- not sure if you mean "it's not permitted" or "it is possible that". Can you rephrase to something like "Depending on configuration, it is possible that incremental stats are not sent" or somesuch? -- To view, visit http://gerrit.cloudera.org:8080/11193 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9d564808ca5157afe4e091909ca6cdac76e60d6e Gerrit-Change-Number: 11193 Gerrit-PatchSet: 2 Gerrit-Owner: Vuk Ercegovac <[email protected]> Gerrit-Reviewer: Bharath Vissapragada <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Tianyi Wang <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Comment-Date: Fri, 24 Aug 2018 16:30:33 +0000 Gerrit-HasComments: Yes
