Dimitris Tsirogiannis has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8235 )

Change subject: IMPALA-5429: Multi threaded block metadata loading
......................................................................


Patch Set 5:

(22 comments)

http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@215
PS5, Line 215:   // Block metadata loading stats for a single HDFS path.
nit: File/block (since we're also loading/refreshing files). Also, you may want 
to change the name of the private class to reflect that, e.g. 
PathMetadataLoadingStats or something like that.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@217
PS5, Line 217: loadedFiles_
You may want to add a comment here. What is loaded vs refreshed? Is the one 
superset of the other. Good to clarify to avoid confusion.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@218
PS5, Line 218: _
I believe the convention is that we don't use '_' for public members.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@231
PS5, Line 231: Runnable
I don't know how this is used later on, but alternatively you can make 
PathBlockMetadataLoadRequest implement Callable<PathBLockMdLoadingStats>, hence 
returning the stats when calling call(). Now you seem to access the stats only 
through the debugString() function.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@247
PS5, Line 247: Blocks on the loadBlockMetadata() call.
Not following this comment. run() either calls refreshBlockMetadata() or 
loadBlockMetadata(), so I can't really interpret what this comment is saying.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@333
PS5, Line 333: loadBlockMetadata
I know I am guilty for some of these names but maybe rename this to 
"resetAndLoadBlockMetadata"? Then it is more clear what the differences are 
between this function and rerfreshBlockMetadata().


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@363
PS5, Line 363: numUnknownDiskIds
Are you overriding or incrementing the value of numUnknownDiskIds in the 
create() function?


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@368
PS5, Line 368: newFileDescs
Hm, that doesn't seem particularly safe (i.e. using the same list for every 
partition). Are we certain that any other partition modification operation 
(e.g. alter partition set location) will not try to override the file 
descriptors, thereby affecting all the other partitions?


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@380
PS5, Line 380: changed
             :    * mtime
It's not just the changed mtime that we're looking for in order to determiner 
modification, so you may want to either remove this or mention all the criteria 
we use. I prefer the former.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@383
PS5, Line 383: The initial table load still uses the listFiles()
             :    * on the data directory that fetches both the FileStatus as 
well as BlockLocations in
             :    * a single call.
remove


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@398
PS5, Line 398: get(0)
Comment why we take the file descriptors of one partition and that it doesn't 
really matter which one we choose.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@399
PS5, Line 399: public String apply(FileDescriptor desc) {
             :               return desc.getFileName();
             :             }
Add @Override and make it a single line (if it fits)


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@448
PS5, Line 448: (fd == null) || (fd.getFileLength() != status.getLen()) ||
             :       (fd.getModificationTime() != status.getModificationTime());
I think we were also checking if the partition was cached. Isn't this check 
needed anymore?


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@455
PS5, Line 455: refreshFileMetadata
Maybe call it refreshPartitionStorageMetadata? Overall, it may make sense to 
replace "File/Block" with "Storage" whenever it makes sense. Thoughts?


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@694
PS5, Line 694: Exception
Why this change?


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@805
PS5, Line 805: ExecutorService partitionLoadingPool = 
Executors.newFixedThreadPool(threadPoolSize);
What is the benefit of having a one thread pool per loaded table instead of a 
global thread pool that all tables can use? In the former approach, it will be 
hard to impose any constraints on the maximum number of concurrent load 
operations. Also, there is some overhead in initializing and tearing down the 
thread pool which you pay every time a table is refreshed.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@821
PS5, Line 821: debugString
This will get information from loadingStats but how do you know that 
loadingStats is not null? I think it may be safer to check and call debugString 
in the catch section of run() and don't make any assumptions about the internal 
state of PathBlockMetadataLoadRequest here. If you do that, then you don't even 
need a Map in L807, only a collection of Futures suffices.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1237
PS5, Line 1237: HashMap
nit: Map


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1303
PS5, Line 1303: Returns the HdfsPartition objects associated with the specified 
list of partition
              :    * names.
Update comment.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
File fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java:

http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java@626
PS5, Line 626: private Table addHdfsPartitions(Table tbl, List<Partition> 
partitions)
             :       throws CatalogException {
             :     Preconditions.checkNotNull(tbl);
             :     Preconditions.checkNotNull(partitions);
             :     if (!(tbl instanceof HdfsTable)) {
             :       throw new CatalogException("Table " + tbl.getFullName() + 
" is not an HDFS table");
             :     }
             :     HdfsTable hdfsTable = (HdfsTable) tbl;
             :     List<HdfsPartition> hdfsPartitions = 
hdfsTable.createAndLoadPartitions(partitions);
             :     for (HdfsPartition hdfsPartition: hdfsPartitions) {
             :       catalog_.addPartition(hdfsPartition);
             :     }
             :     return hdfsTable;
             :   }
Lot's of duplication with the previous one. How expensive would it be to simply 
have something like:
addHdfsPartition(Table tbl, Partition partition) {
 addHdfsPartitions(tbl, List.newArrayList(partition));
}


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/util/ListMap.java
File fe/src/main/java/org/apache/impala/util/ListMap.java:

http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/util/ListMap.java@59
PS5, Line 59: synchronized (list_) {
Why synchronized on list_? This confuses the locking scheme. If you want you 
can simply do synchronized (this). Same in L76.


http://gerrit.cloudera.org:8080/#/c/8235/5/fe/src/main/java/org/apache/impala/util/ListMap.java@78
PS5, Line 78: list_ = Collections.synchronizedList(list);
Hm, I think this new implementation is way too complicated. I'd start with 
simply making the getIndex and populate methods synchronized. If we see this 
becoming a bottleneck, then try to optimize it.



--
To view, visit http://gerrit.cloudera.org:8080/8235
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481
Gerrit-Change-Number: 8235
Gerrit-PatchSet: 5
Gerrit-Owner: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]>
Gerrit-Reviewer: Jim Apple <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Vuk Ercegovac <[email protected]>
Gerrit-Comment-Date: Wed, 11 Oct 2017 22:10:27 +0000
Gerrit-HasComments: Yes

Reply via email to