Alex Behm has posted comments on this change. Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading ......................................................................
Patch Set 2: (18 comments) High-level comments. Let's get through them first and then I can dig in deeper. http://gerrit.cloudera.org:8080/#/c/5148/2/fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java File fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java: Line 27: * This class encapsulates the logic required for mapping HDFS Much fluff, simplify to: Maps HDFS storage-UUIDs to per-host 0-based, sequential disk ids required by Impala backends. Line 29: * by the backend. required by Impala backends Line 31: public class DiskIdMapper { Make this a singleton to clarify the intended use. Line 36: // This is intentionally implemented as a static object to share the storageIDs across move this part to the class comment Line 38: // consistent across all the tables, without which, we can potentially confuse the also mention that the static solution requires the least amount of memory Line 48: private static ConcurrentHashMap<String, AtomicInteger> Don't think we need a ConcurrentHashMap or AtomicInteger here, because whenever we update this entry, we also need to protect storageIdToInt, so we need a coarser-grained lock. Line 53: * TODO (Bharath): This method is still prone to some race conditions. Fix it before Correct. I don't think the ConcurrentHashMaps will help us because we need to atomically update both maps sometimes. Line 71: storageIdGenerator.put(host, new AtomicInteger(diskId)); Couple of races here like you already pointed out. http://gerrit.cloudera.org:8080/#/c/5148/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: Line 294: // Keeps track of all the parition paths that are filtered to add block metadata. this comment style for classes and functions /** * */ Line 368: if (!pathMd.filteredParitionPaths.contains(fileStatus.getPath())) continue; Looks like this contains() and the indexOf() below can become very expensive. Instead of the 3 arrays in pathMd we should consider using a HashMap from path to a class that contains the relevant metadata (FD and format). Line 423: for (int i=0; i < fileMd.pathFds.size(); ++i) { i = 0 (add spaces) Line 458: // Only build the storage IDs fs instances that support the BlockLocation API. ... for fs instances ... Line 720: private void loadAllPartitions( After reading through this a few times I'm still confused about the high-level code flow. It would be great to summarize the high-level steps in the function comment. My understanding is that we do something like this: 1. createPartition() for every partition; that populates pathMd; internally createPartition() lists all files in the corresponding partition directory 2. load the block location metadata based on the contents of pathMd The process above seems a bit backwards to me. In the first step we list all files under the partition's directory... and then we do the same thing again in step 2. Intuitively, I was thinking it should look something like this: 1. collect all dirs to call DFS.listFiles() on 2. for each dir, call listFiles(), examine the contents and map them to an msPartition. Based on that complete info do some filtering/checks and finally createPartition(). Another chat would probably help me understand the ideas here better. Line 777: if (dirsToLoad.contains(partDir) || positive is more readable to me if (!dirsToLoad.contains() && isChildPath()) { // Partition with a non-default data directory. dirsToLoad.add(); } http://gerrit.cloudera.org:8080/#/c/5148/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java: Line 275: * Returns true if the filesystem supports BlockLocation API. What API is it specifically? LocatedFileStatus.getBlockLocations()? Line 277: public static boolean supportsBlockLocationAPI(FileSystem fs) { supportsBlockLocationApi Line 406: public static boolean isPathChild(Path p1, Path p2) { isChildPath(Path p, Path parent) Line 408: Path parentDir = p1.getParent(); I think this can be improved by using the Path.getDepth(). You can do getParent() a fixed number of times and then do one equals() comparison. (Basically you walk up the path starting at p1 until the parentDir is at the same depth as p2) -- To view, visit http://gerrit.cloudera.org:8080/5148 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath Vissapragada <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Bharath Vissapragada <[email protected]> Gerrit-HasComments: Yes
