[Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

Bharath Vissapragada (Code Review) Mon, 28 Nov 2016 20:35:59 -0800

Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
......................................................................



Patch Set 6:

(29 comments)

http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java
File fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java:

Line 37: 
> why not:
Done


Line 50:     // the storage ID of a particular disk is unique across all the 
nodes in the cluster.
> make members and methods non-static since we are using a singleton
Done


Line 63:      * the initial metadata load. Figure out ways to fix it using 
finer locking scheme.
> storageUuid
Done


Line 72:       // amount of data loading is done and storageIdtoInt is 
populated with storage IDs
> I did some more reading and I'm afraid this "double checked locking" scheme
Done


http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 120: import java.io.IOException;
> I think these might need cleanup.
My editor is messing up these imports. Let me try fixing them by hand.


Line 351:       Map<String, Map<String, FileDescriptor>> newFileDescMap = 
Maps.newHashMap();
> What's the benefit of populating this temporary map and then merging it int
I did that initially but the problem happens when one of the files is deleted. 
For example, consider the following sequence of actions.

- create partition p1 with file f1
- perPartitionFileDescMap_ contains f1
- insert overwrite p1 select * from random_table, say this overwrites f1 and 
creates f2.
- In such a case, we loop p1, and add f2 to perPartitionFileDescMap_ but we 
still have f1 at the end of it.
- So we need to scan all entries in perPartitionFileDescMap_ and figure out the 
difference with listStatus() and remove them. This is a costly operation.
- Easier way is to build a newFdMap with the current snapshot of listStatus() 
and replace it in perPartitionFileDescMap_

Also this way I felt the code is more readable. This is arguable, but I felt so.


Line 382:           // Loop over all blocks in the file.
> comment doesn't add anything, remove
Done


Line 384:             Preconditions.checkNotNull(loc);
> Not your change, but this looks weird, why would it be null? Remove?
Done


Line 386:             String[] blockHostPorts = loc.getNames();
> Let's just use loc.getNames() and loc.getHosts() inside the loop in L403. I
We are using them in preconditions check in L396 and loop condition in L403 and 
L405,L408 for index based access. Let me know if you still think we should 
remove it.


Line 395:                     
Sets.newHashSet(Arrays.asList(loc.getCachedHosts()));
> I think we can get rid of the asList() here
nice catch. done


Line 399:             // to hostMap_/hostList_. The host ID (index in to the 
hostList_) for each
> update comment, "hostMap_" doesn't seem to exist
Remnants of old clean-up. This logic is now merged into hostIndex_. Updated 
comment accordingly.


Line 413:           // Remember the THdfsFileBlocks and corresponding 
BlockLocations.  Once all the
> is this comment still accurate?
Yes, we basically add all these blocks to perFsFileBlocks and run getDiskId() 
on it in bulk.


Line 431:           numHdfsFiles_ += fileDescs.size();
> Why do the number of files and bytes always keep increasing?
I think its due to the current design. The reload is implemented as 
dropPartition() + creatPartition(). Also we resetPartitions() at few places to 
make it 0. Basically we never delete entries from it, we just overwrite them 
with new values.


Line 458:                 fileStatus.getModificationTime());
> indent 4 (and same issue elsewhere)
Fixed my editor settings now. Done.


Line 468:         // All the partitions corresponding to the same partition 
path, have the same
> I don't think this is necessarily true, but you can state that we assume it
Ok, modified the comment.


Line 483:         for (HdfsPartition partition: partitions) {
> This function looks much more expensive than before. We loop over all parti
Although it looks like O(N^2), it isn't. 'partitions' correspond to each 
partition that maps to a single location. So if we flatten 
List<List<Partition>> its actually all the partitions in the table. Basically 
we don't loop like a conventional nested loop "for each partition, for each 
fileStatus". We loop through the pruned partition list i.e, partitions 
corresponding to that path.


Line 780:     Map<FsKey, FileBlocksInfo> blocksToLoad = Maps.newHashMap();
> unused?
Done


Line 784:     HashMap<Path, List<HdfsPartition>> filteredParts = 
Maps.newHashMap();
> The var name filteredParts is a little confusing to me. How about partsByPa
Done


Line 830:         if (filteredParts.containsKey(partDir)) 
filteredParts.get(partDir).add(partition);
> use {}
Done


Line 833:         // Check if the partition directory's is a child dir of 
hdfsBaseDir_
> typo: directory's -> directory
Done


Line 835:             !FileSystemUtil.isChildPath(partDir, 
getHdfsBaseDirPath())) {
> use tblLocation, that's clearer (and in comment above)
Done


Line 836:           // Add the partition to the list of paths to load block MD.
> Comment only describes the code below, maybe say something like:
Done


Line 868:       loadBlockMetadata(fs, location, filteredParts, 
fileBlocksToLoad);
> Having separate loadBlockMetadata() and loadDiskIds() doesn't seem to make 
Yea it doesn't make sense. I wanted to refactor that but I forgot. Thanks for 
pointing it out.


Line 934:     boolean isMarkedCached = isMarkedCached_;
> looks unused, are you sure nothing got broken here?
Yep, wanted to discuss about this. I'll ping you offline.


Line 953:       try {
> not your change, but you can do this instead:
Done


Line 1583:     boolean isMarkedCached = isMarkedCached_;
> looks unused, sure nothing is broken here?
Same as above. I wanted to discuss this offline.


Line 1628:         totalHdfsBytes_ += hdfsPart.getSize();
> Why can't this stay inside addPartition()? In the loading case the size and
Yea, now that I think about it, it can be in addPartition(). Moved them.


http://gerrit.cloudera.org:8080/#/c/5148/6/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 347:       LOG.debug("Loading block md for " + name_ + " directory " + 
dirPath.toString());
> Please add If(LOG.isDebugEnabled()) to avoid calling the dirPath.toString()
Done. Added it in a couple of places that can potentially spam.


http://gerrit.cloudera.org:8080/#/c/5148/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 423:     if (p.isRoot()) return false;
> isn't the equals() below sufficient?
That returns true for isChildPath(root, root)?


-- 
To view, visit http://gerrit.cloudera.org:8080/5148
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Bharath Vissapragada <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

Reply via email to