Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11227 )

Change subject: IMPALA-7047. Refreshing partitions should not make an RPC per 
file
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/11227/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11227/4//COMMIT_MSG@12
PS4, Line 12: since the prior load, it was instead calling it on every file.
> General question. In HDFS, we can tell if a file was added or removed. Once
In terms of files "changing", we do have to worry about a file being replaced 
with the same name as a pre-existing file. I think that's why we compare mtimes.

In terms of balancing blocks, we detect on the backend if we did a remote read 
of a block, and if so, we issue a SQL warning which tells the user to issue an 
INVALIDATE METADATA for that table.

Personally I think this sucks -- what we should probably do is have the backend 
notify the catalogd that the block locations for that file need to be 
refreshed, and do so automatically.

That said, since we're eventually hoping to move more towards a fine-grained 
caching with TTL, any such incorrect locality info would fall out of the cache 
within a bounded period of time, so maybe not a big deal anyway.


http://gerrit.cloudera.org:8080/#/c/11227/4//COMMIT_MSG@28
PS4, Line 28: I also tested this by pointing my dev box at a remote filesystem 
that
it's interesitng that there was a "reverse logic" bug in my patch, but I did 
test it and saw the desired effects. We should make sure to circle back and 
re-run a similar test to make sure I saw what I thought I was seeing.


http://gerrit.cloudera.org:8080/#/c/11227/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/11227/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@568
PS4, Line 568:         stats = refreshFileMetadata(partDir, 
Collections.singletonList(partition));
> Question about the implementation of refresh (that code is not in this patc
I think Java supports arrays up to 2B elements, but HDFS only supports really 
up to a few hundred million files. Even so, I think many other parts of Impala 
are likely to fall over way before this (probably at 100k files in a partition 
lots of things start to break).

I seem to recall that, in terms of the underlying HDFS RPC itself, it already 
splits it into smaller critical sections of the NS lock on the NN, and maybe 
even into multiple RPCs.

All that to say, I'm not too concerned about this point.


http://gerrit.cloudera.org:8080/#/c/11227/4/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
File fe/src/test/java/org/apache/impala/catalog/CatalogTest.java:

http://gerrit.cloudera.org:8080/#/c/11227/4/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@371
PS4, Line 371:     assertEquals(0L, 
(long)opsCounts.getLong("op_get_file_block_locations"));
> Something is odd. This test did not fail for the "reverse polarity" bug tha
This part of the test is for "REFRESH" on the table level -- does that trigger 
the code path where there was the bug?


http://gerrit.cloudera.org:8080/#/c/11227/4/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@379
PS4, Line 379:     assertEquals(0L, 
(long)opsCounts.getLong("op_get_file_block_locations"));
this is the one that I'm surprised didn't fail with the bug. Perhaps we can add 
some log statements or use the debugger to step through the test and understand 
better what's happening.



--
To view, visit http://gerrit.cloudera.org:8080/11227
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2051b96599206164aaa06ecbdf64374c46eda956
Gerrit-Change-Number: 11227
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <par0...@yahoo.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com>
Gerrit-Comment-Date: Tue, 27 Nov 2018 19:29:15 +0000
Gerrit-HasComments: Yes

Reply via email to