[Impala-ASF-CR] IMPALA-5429: Multi threaded block metadata loading

Alex Behm (Code Review) Mon, 16 Oct 2017 16:40:52 -0700

Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8235 )


Change subject: IMPALA-5429: Multi threaded block metadata loading
......................................................................


Patch Set 6:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/8235/6/be/src/catalog/catalog.cc
File be/src/catalog/catalog.cc:

http://gerrit.cloudera.org:8080/#/c/8235/6/be/src/catalog/catalog.cc@38
PS6, Line 38: DEFINE_int32(max_hdfs_parts_parallel_load, 5,
Let's spell out "parts" into "partitions" to avoid confusion.


http://gerrit.cloudera.org:8080/#/c/8235/6/be/src/catalog/catalog.cc@40
PS6, Line 40:     "tables. Due to HDFS architectural limitations, it is 
unlikely to get a linear "
In the sentence "Due to HDFS architectural..." what does the "it" after the 
comma refer to? Better to spell it out, e.g.,

"the response time of block metadata loading is unlikely to improve beyond 5 
threads."


http://gerrit.cloudera.org:8080/#/c/8235/6/be/src/catalog/catalog.cc@42
PS6, Line 42: DEFINE_int32(max_s3_parts_parallel_load, 20,
Does ADLS fall into this or the HDFS bucket? What about other filesystems?


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@216
PS6, Line 216:   private class PathStorageMdLoadingStats {
Naming seems a little weird. I think "Storage" is too generic and "Block" is 
too specific. How about we settle on "FileMetadata", so we'd have:

FileMetadataLoadStats
FileMetadataLoadRequest


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@240
PS6, Line 240:     // If set to true, reloads the block metadata only when the 
underlying file
Please clarify "the underlying file"


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@413
PS6, Line 413:         ++loadStats.ignoredFiles;
I think we should distinguish the hidden file and already-up-to-date file 
cases, e.g. split ignoredFiles into 'hiddenFiles' and 'skippedFreshFiles' or 
similar


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@773
PS6, Line 773:    * Returns the thread pool size to load the block metadata of 
this table.
Command on how we distinguish between HDFS, S3 and other filesystems (using the 
table root path?), and what the behavior is for mixed-filesystem tables.


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@793
PS6, Line 793:     return Math.max(Math.min(numPaths, threadPoolSize), 1);
For my understanding, why is the max() needed here? Is it not be a precondition 
that numPaths > 0 and threadPoolSize > 0?


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@800
PS6, Line 800:    * metadata is reloaded, else it is loaded from scratch.
Difference between "reloaded" and "loaded from scratch" is not very clear. 
Maybe say "incrementally refreshed" vs. "reloaded from scratch"?


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@804
PS6, Line 804:     int numPathsToLoad = partsByPath.keySet().size();
partsByPath.size()?


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@812
PS6, Line 812:       for (Path p: partsByPath.keySet()) {
The tasks are ordered randomly. I wonder if submitting the tasks in a clustered 
fashion would be better or worse. Not relevant to this patch, but might be 
interesting as a follow-on investigation.


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@821
PS6, Line 821:         } catch (ExecutionException | InterruptedException e) {
We still consider the load successful if one of these fails. What do you think 
the right behavior should be? Is a partially loaded table an acceptable outcome?


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@822
PS6, Line 822:           LOG.error("Encountered an error loading block metadata 
for table: " +
Let's also dump the path


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@898
PS6, Line 898:    * the newly created HdfsPartitions in parallel.
remove "the"


http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/util/ListMap.java
File fe/src/main/java/org/apache/impala/util/ListMap.java:

http://gerrit.cloudera.org:8080/#/c/8235/6/fe/src/main/java/org/apache/impala/util/ListMap.java@38
PS6, Line 38:   private List<T> list_ = Collections.synchronizedList(new 
ArrayList<T>());
The CC requirements and behavior are not clear to me. Why is it not sufficient 
to make all methods synchronized?



--
To view, visit http://gerrit.cloudera.org:8080/8235
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481
Gerrit-Change-Number: 8235
Gerrit-PatchSet: 6
Gerrit-Owner: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]>
Gerrit-Reviewer: Jim Apple <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Vuk Ercegovac <[email protected]>
Gerrit-Comment-Date: Mon, 16 Oct 2017 23:40:28 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5429: Multi threaded block metadata loading

Reply via email to