Philip Zeyliger has posted comments on this change. ( )

Change subject: IMPALA-5152: Introduce metadata loading phase

Patch Set 3:


I was curious how this works, so took a look through. It certainly makes a ton 
of sense to me to explicitly load the metadata as one phase of the query.
Commit Message:
PS3, Line 62: Testing:
            : - A core/hdfs run succeeded
            : - No new tests were added because the existing functional tests
            :   provide good coverage of various metadata loading scenarios.
            : - The issue reported in IMPALA-5152 is basically impossible now.
            :   Adding FE unit tests for that bug specifically would require
            :   ugly changes to the new code to enable such testing.
Your view of these tests is obviously much deeper than mine. We expect that for 
most queries, the number of round trips (on a clean "cache") is exactly one, 
but for views involving views, it's one plus the number of views (or fewer). 
Does it make sense to explicitly count round trips (perhaps expose it as a 
metric of "number of rounds of prioritized loading" in the profile) and concoct 
one of these deliberately?
File fe/src/main/java/org/apache/impala/analysis/
PS3, Line 284:   public static List<TableName> getCandidateTables(List<String> 
path, String sessionDb) {
This seems easy to unit test. Would it make sense to do so?
File fe/src/main/java/org/apache/impala/analysis/
PS3, Line 61:    * Subclasses should override this method as necessary.
Is it possible that someone will forget to do so in the future in an unpleasant 
way? (You could make this abstract and force implementors to provide the empty 
function definition explicitly.)
File fe/src/main/java/org/apache/impala/planner/
PS3, Line 231:     stmt.collectTableRefs(tblRefs, true);
I don't know what the usual style here is, but you might want to consider:

"true /* only from clause */" or even creating an enum or creating a method 
called collectTableRefsFromClauseOnly().

Basically, boolean arguments are really hard to remember.
File fe/src/main/java/org/apache/impala/service/
PS1, Line 964:       if (tbl == null) continue;
Perhaps %d requested and %d loaded tables? I think the fraction (e.g., "10/8") 
would be meaningless without the source.
File fe/src/main/java/org/apache/impala/service/
PS3, Line 827:   public static final class StmtTableCache {

Is this actually a cache?
PS3, Line 866:     final long debugLoggingFrequency = 10;
nit: rename to include units? (e.g., debugLoggingFrequencyMillis).

In a previous life, I've made these difficult-but-configurable via Java system 


Up to you.
PS3, Line 894:    // Loop until all the missing tables are loaded in the 
Impalad's catalog cache.
             :     // In every iteration of this loop we wait for one catalog 
update to arrive.
             :     while (!missingTbls.isEmpty()) {
Is there a limit to the number of iterations we're willing to do? Similarly, 
can we check if the query has been cancelled?

I worry that a bug here will cause an infinite loop, whereas in practice, if 
catalogUpdateCount > 10, something is wrong?
PS3, Line 906:         LOG.warn("Catalog restart detected while waiting for 
table metadata.");
For debugging, would it be useful to log the old and new versions of the 
catalog? Those don't seem to be obviously exposed right now (but could maybe be 
exposed via a toString() implementation.)
PS3, Line 928:       // 3) Periodically retry to avoid a hang due to 
anomalies/bugs, e.g.,
Do you feel that this case requires more explicit logging?
PS3, Line 959:         tbl = catalog.getTable(tblName.getDb(), 
There's a "throw on error" argument to getTable(). Should that be used instead 
of the try/catch?
PS3, Line 960:       } catch (CatalogException e) {
             :         Preconditions.checkState(e instanceof 
is this just

catch (DatabaseNotFoundException ignored) {
  // ...

i.e., if you get anything else, it'll rethrow via the Precondition.

Might be nice to have a comment about why, though.
PS3, Line 964:       if (tbl == null) continue;
Is there a chance that not recording tbl==null causes extra round trips?

For example, let's say we resolved a view, and that view refers to table "x". 
getCandidateTables() might generate "y.x" which isn't loaded because it doesn't 
exist, and perhaps we already knew that, because "y.x" was generated in a 
previous pass as well.

I've not thought this all the way through, but seems like it's possible.

To view, visit
To unsubscribe, visit

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I68d32d5acd4a6f6bc6cedb05e6cc5cf604d24a55
Gerrit-Change-Number: 8958
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Bharath Vissapragada <>
Gerrit-Reviewer: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Philip Zeyliger <>
Gerrit-Reviewer: Vuk Ercegovac <>
Gerrit-Comment-Date: Mon, 12 Feb 2018 20:20:05 +0000
Gerrit-HasComments: Yes

Reply via email to