Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11193 )

Change subject: IMPALA-7425: Change incremental stats to pull from catalogd.
......................................................................


Patch Set 4:

(24 comments)

http://gerrit.cloudera.org:8080/#/c/11193/4/be/src/common/global-flags.cc
File be/src/common/global-flags.cc:

http://gerrit.cloudera.org:8080/#/c/11193/4/be/src/common/global-flags.cc@210
PS4, Line 210: If used, the flag must "
             :     "be set on both catalogd and all impalad coordinators.");
Do we have any way to emit warnings if this setting ends up inconsistent in a 
deployment? eg could we include the setting as a bool in the TCatalog object 
that the catalog broadcasts and issue a warning on the impalads if they see 
that it doesn't match their own setting? Afraid of people setting this on one 
role but not the other and then seeing either incorrect results or none of the 
expected statestore topic size benefits.

If this seems like it'll be a big hassle maybe not worth the effort and we can 
just make sure it's documented clearly.


http://gerrit.cloudera.org:8080/#/c/11193/4/be/src/service/fe-support.cc
File be/src/service/fe-support.cc:

http://gerrit.cloudera.org:8080/#/c/11193/4/be/src/service/fe-support.cc@549
PS4, Line 549: extern "C" JNIEXPORT jbyteArray JNICALL
side node gripe: we should figure out some way to refactor all of these 
almost-identical methods in this file some day!


http://gerrit.cloudera.org:8080/#/c/11193/4/be/src/service/fe-support.cc@606
PS4, Line 606:     {const_cast<char*>("NativeFeTestInit"), 
const_cast<char*>("()V"),
nit: the reformatting of all the unrelated code is going to make this a pain in 
the butt to backport (guaranteed conflict)


http://gerrit.cloudera.org:8080/#/c/11193/4/common/thrift/CatalogService.thrift
File common/thrift/CatalogService.thrift:

http://gerrit.cloudera.org:8080/#/c/11193/4/common/thrift/CatalogService.thrift@380
PS4, Line 380: // identified by their id. The caller's table_version_number is 
used by the catalog
per offline chat earlier, I think it makes sense to just always fetch all 
partitions, unless we found some use case where we don't end up needing them 
all.

On a semi-related note, should we consider having the catalogd merge the HLLs 
before sending them back? Do we actually need partition-level HLLs or do we 
just need the aggregate? Or perhaps the aggregate while _excluding_ the ones 
currently being recomputed? Definitely not to do in this patch, just thought 
about it.


http://gerrit.cloudera.org:8080/#/c/11193/4/common/thrift/CatalogService.thrift@385
PS4, Line 385:   // The CatalogService service ID this request came from.
but the request came from an impalad.. maybe clearer to be a little more 
specific about the purpose that this is the service from which the below 
version number was generated, so that we can detect if the version and such are 
out of date?

Though maybe this is no longer relevant if we move to just fetching all 
partitinos of a table, by name, and we don't need the version anymore


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@a513
PS4, Line 513:
nit: some unrelated removal of blank lines might make merging conflicts hard


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@449
PS4, Line 449: new HashSet<Long>()
use Collections.emptySet() here to emphasize that it's only an in-parameter? 
When I first read this call site I thought this was some kind of "initial 
output set" or something. Might be worth adding a /*excludedPartitions=*/ 
named-parameter style comment to make that more clear.


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@452
PS4, Line 452: partitionStats != null
why not make partitionStats always non-null and just return an empty map in the 
case that we dont have any?


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@497
PS4, Line 497: relevant
should this be _other_ partitions? ie we need to fetch stats from the ones that 
aren't being recomputed.


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@595
PS4, Line 595: getPartitionStats
maybe rename to "getOrFetchPartitionStats" to emphasize that this one is the 
higher-level one?

btw can this method and the other both be static, perhaps?


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@600
PS4, Line 600: maxNumStats
this is more of an "expected" than actually a hard max, right?


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@601
PS4, Line 601: checkState
if this precondition fails, it's actually a bad argument, not a bad state, 
right?


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@634
PS4, Line 634: ps
I think if you return an empty map here it'll avoid some null checks up above


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@644
PS4, Line 644: Exception
this will also catch the above AnalysisException, resulting in a doubly-wrapped 
AnalysisException. Perhaps change this one to be more specifically TException? 
Or add a Throwables.propagateIfInstanceOf(e, AnalysisException.class); into the 
catch block?


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@645
PS4, Line 645: :
nit: no need for the ':' in the exception message itself. When printed it will 
already say someting like: "Caused by:" with the chained exception following.


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java
File fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java:

http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@392
PS4, Line 392: catalog_service_id != catalogServiceId_
this should be a .equals() rather than reference-equals, no?


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@393
PS4, Line 393: ()
%s not ()


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@406
PS4, Line 406:    // Table could be null if it does not exist anymore.
             :     if (table == null) return stats;
should we throw an exception in that case? Otherwise, if you ran COMPUTE STATS 
concurrent with a drop, the impalad would schedule stats computation over _all_ 
partitions that it had cached at the time you issued the statement (thinking 
none of them had stats). I think you'd probably prefer a failure saying table 
was deleted.


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@419
PS4, Line 419:           + " loaded table: " + request.table_name);
maybe include the 'cause' of the IncompleteTable?


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/service/FeSupport.java
File fe/src/main/java/org/apache/impala/service/FeSupport.java:

http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/service/FeSupport.java@333
PS4, Line 333:   LOG.info(
             :         String.format
nit: weird formatting

Instead you can also just use LOG.info("with {} placeholders", obj1, ...)
which is a bit shorter and standard


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/service/JniCatalog.java
File fe/src/main/java/org/apache/impala/service/JniCatalog.java:

http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/service/JniCatalog.java@276
PS4, Line 276: Lists.<String>newArrayList
nit: Collections.emptyList


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/service/JniCatalog.java@279
PS4, Line 279: Lists.newArrayList
ImmutableList.of(e.getMessage())


http://gerrit.cloudera.org:8080/#/c/11193/4/fe/src/main/java/org/apache/impala/service/JniCatalog.java@280
PS4, Line 280:       throw e;
don't you _not_ want to re-throw, so you can serialize it back into the 
response to the backend? Or, if you throw, then don't bother setting it in the 
response and rely on the backend catching the java exception?


http://gerrit.cloudera.org:8080/#/c/11193/4/tests/custom_cluster/test_pull_stats.py
File tests/custom_cluster/test_pull_stats.py:

http://gerrit.cloudera.org:8080/#/c/11193/4/tests/custom_cluster/test_pull_stats.py@30
PS4, Line 30:   
@CustomClusterTestSuite.with_args("--pull_incremental_statistics=true",
this would read better if you also use the impalad_args= kwarg even though it 
might be optional



--
To view, visit http://gerrit.cloudera.org:8080/11193
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9d564808ca5157afe4e091909ca6cdac76e60d6e
Gerrit-Change-Number: 11193
Gerrit-PatchSet: 4
Gerrit-Owner: Vuk Ercegovac <[email protected]>
Gerrit-Reviewer: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Tianyi Wang <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Vuk Ercegovac <[email protected]>
Gerrit-Comment-Date: Tue, 28 Aug 2018 03:17:00 +0000
Gerrit-HasComments: Yes

Reply via email to