[Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

Alex Behm (Code Review) Mon, 21 Nov 2016 21:31:18 -0800

Alex Behm has posted comments on this change.

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
......................................................................



Patch Set 2:

(18 comments)

High-level comments. Let's get through them first and then I can dig in deeper.

http://gerrit.cloudera.org:8080/#/c/5148/2/fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java
File fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java:

Line 27:  * This class encapsulates the logic required for mapping HDFS
Much fluff, simplify to: Maps HDFS storage-UUIDs to per-host 0-based, 
sequential disk ids required by Impala backends.


Line 29:  * by the backend.
required by Impala backends


Line 31: public class DiskIdMapper {
Make this a singleton to clarify the intended use.


Line 36:     // This is intentionally implemented as a static object to share 
the storageIDs across
move this part to the class comment


Line 38:     // consistent across all the tables, without which, we can 
potentially confuse the
also mention that the static solution requires the least amount of memory


Line 48:     private static ConcurrentHashMap<String, AtomicInteger>
Don't think we need a ConcurrentHashMap or AtomicInteger here, because whenever 
we update this entry, we also need to protect storageIdToInt, so we need a 
coarser-grained lock.


Line 53:      * TODO (Bharath): This method is still prone to some race 
conditions. Fix it before
Correct. I don't think the ConcurrentHashMaps will help us because we need to 
atomically update both maps sometimes.


Line 71:         storageIdGenerator.put(host, new AtomicInteger(diskId));
Couple of races here like you already pointed out.


http://gerrit.cloudera.org:8080/#/c/5148/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 294:   // Keeps track of all the parition paths that are filtered to add 
block metadata.
this comment style for classes and functions

/**
 *
 */


Line 368:         if 
(!pathMd.filteredParitionPaths.contains(fileStatus.getPath())) continue;
Looks like this contains() and the indexOf() below can become very expensive.

Instead of the 3 arrays in pathMd we should consider using a HashMap from path 
to a class that contains the relevant metadata (FD and format).


Line 423:     for (int i=0; i < fileMd.pathFds.size(); ++i) {
i = 0 (add spaces)


Line 458:       // Only build the storage IDs fs instances that support the 
BlockLocation API.
... for fs instances ...


Line 720:   private void loadAllPartitions(
After reading through this a few times I'm still confused about the high-level 
code flow. It would be great to summarize the high-level steps in the function 
comment.

My understanding is that we do something like this:
1. createPartition() for every partition; that populates pathMd; internally 
createPartition() lists all files in the corresponding partition directory
2. load the block location metadata based on the contents of pathMd

The process above seems a bit backwards to me. In the first step we list all 
files under the partition's directory... and then we do the same thing again in 
step 2.

Intuitively, I was thinking it should look something like this:

1. collect all dirs to call DFS.listFiles() on
2. for each dir, call listFiles(), examine the contents and map them to an 
msPartition. Based on that complete info do some filtering/checks and finally 
createPartition().

Another chat would probably help me understand the ideas here better.


Line 777:         if (dirsToLoad.contains(partDir) ||
positive is more readable to me

if (!dirsToLoad.contains() && isChildPath()) {
  // Partition with a non-default data directory.
  dirsToLoad.add();
}


http://gerrit.cloudera.org:8080/#/c/5148/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 275:    * Returns true if the filesystem supports BlockLocation API.
What API is it specifically? LocatedFileStatus.getBlockLocations()?


Line 277:   public static boolean supportsBlockLocationAPI(FileSystem fs) {
supportsBlockLocationApi


Line 406:   public static boolean isPathChild(Path p1, Path p2) {
isChildPath(Path p, Path parent)


Line 408:     Path parentDir = p1.getParent();
I think this can be improved by using the Path.getDepth(). You can do 
getParent() a fixed number of times and then do one equals() comparison.

(Basically you walk up the path starting at p1 until the parentDir is at the 
same depth as p2)


-- 
To view, visit http://gerrit.cloudera.org:8080/5148
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Bharath Vissapragada <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

Reply via email to