[Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

Alex Behm (Code Review) Tue, 29 Nov 2016 16:45:04 -0800

Alex Behm has posted comments on this change.

Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading
......................................................................



Patch Set 11:

(40 comments)

http://gerrit.cloudera.org:8080/#/c/5148/11/fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java
File fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java:

Line 29:  * ids. This is internally implemented as a static object to share the 
storageIDs across
Several sentences Starting with "This", sometimes not clear what "This" 
references.

Would be nice to list the reasons in bullet point, something like this.

The storage UUID to disk id is global (shared across tables) for the following 
reasons:
1. Mention consistent scan range to disk thread assignment
2. Mention memory benefit
3. any more?


Line 44:     // sequential 0-based integer disk id used by the hdfs scanners. 
This assumes that
used by the BE scanners.


Line 46:     private ConcurrentHashMap<String, Integer> storageIdToInt =
storageUuidToDiskId


Line 50:     // with the corresponding latest 0-based integer ID. Given a 
storage UUID, we first
with -> to


Line 51:     // look in 'storageIdtoInt' map if it already exists, else we 
generate a new ID for
no need for this second sentence, that's what we do in getDiskId()


Line 64:       if (Strings.isNullOrEmpty(storageUuid)) return diskId;
let's move this check to the caller and require the storageUuid to be non-null 
and non-empty in here


Line 68:       // across the cluster, it is possible that we have a good hit 
rate.
it is possible -> we expect to have a good hit rate.


Line 72:         // At this point, it could be possible due to race, that 
storageId might've been
// Mapping might have been added by another thread that entered the 
synchronized block first.


Line 76:         // No mapping exists, create a new disk ID for 'storageId'
stoprageUuid


Line 80:           // Host hasn't been populated yet. Start with a 0-based id.
// First disk id of this host.


http://gerrit.cloudera.org:8080/#/c/5148/11/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 289:    * given directory 'dirPath'. Only blocks corresponding to paths 
from 'partsByPath'
Mention that the block metadata is effectively dropped and re-created.

Please provide a brief summary of the steps involved in this function. 
something like:
1. Clear block metadata of partitions
2. Call HDFS to list all files
3. etc.


Line 307:       RemoteIterator<LocatedFileStatus> fileStatusIter = 
fs.listFiles(dirPath, true);
move right before the while loop


Line 308:       // Reset the current state of partitions under dirPath from 
perPartitionFileDescMap_
Reset -> Clear


Line 310:       for (Path partDir: partsByPath.keySet()) {
Iterate over the entrySet(), you are using an unnecessary get(partDir) below.


Line 321:         String fileName = fileStatus.getPath().getName();
move closer to where it's used, i.e. right above L336


Line 322:         if (!FileSystemUtil.isValidPartitionFile(fileStatus)) 
continue;
isValidDataFile()


Line 323:         // At this point, file status points to a valid file under 
the directory. Now we
// Find the partition that this file belongs (if any).


Line 325:         Path partPathDir = fileStatus.getPath().getParent();
move closer to where it's used, i.e. L357


Line 330:         // Skip if this file doesn't correspond to any of the 
existing partitions.
Skip if this file does not belong to any known partition.


Line 332:           LOG.debug("File " + fileStatus.getPath().toString() + " 
doesn't correspond " +
add if to check log level


Line 333:               " to a valid partition. Skipping metadata load for this 
file.");
to a known partition


Line 361:         
Preconditions.checkState(perPartitionFileDescMap_.containsKey(partPathDirName));
avoid duplicate lookup with get() + Preconditions.checkNotNull()


Line 363:         // Update the partitions metadata corresponding to this file 
descriptor
partitions' metadata that this file belongs to.


Line 371:         LOG.warn("Unknown disk id count for filesystem " + fs + ":" + 
unknownDiskIdCount);
add if to check log level


Line 374:       throw new RuntimeException("Error loading block md for 
directory "
block metadata


Line 380:    * Loads the disk IDs for BlockLocation 'location' and its 
corresponding file block.
Comment should mention the storage UUID to disk id mapping.


Line 386:     String[] hosts;
Do you know if these are the unique hosts? I'm wondering if storageIds and 
hosts are guaranteed to have the same length.


Line 394:     for (int i =0; i < storageIds.length; ++i) {
int i = 0; (space after "=")


Line 410:     for (Path partPath: partsByPath.keySet()) {
iterate over Map.Entry, you are doing a get() in the next line


Line 412:       // For each valid file in the partPath, synthesize the block 
metadata.
For each data file in the partPath...


Line 427:         // We assume all the partitions corresponding to the same 
partition path, have
For the purpose of synthesizing block metadata, we assume that all partitions 
with the same location have the same file format.


Line 447:           perPartitionFileDescMap_.put(partition.getLocation(), 
fileDescMap);
move put() inside the if (fileDescMap == null) block


Line 673:    * Create HdfsPartition objects corresponding to 'msPartitions' and 
add them to this
Briefly list the high-level steps, in particular that we first create the 
partition objects based on the HMS partitions, and then in a second step we 
populate the partitions' file metadata


Line 742:         // Check if the partition directory is a child dir of tbl 
directory location.
remove, comment only states what the code does


Line 752:     // Load the block metadata
remove comment (doesn't add anything)


Line 764:       throw new CatalogException("Error loading block location md for 
partition " +
md -> metadata


Line 827:   public HdfsPartition createPartition(StorageDescriptor 
storageDescriptor,
change this to createAndLoadPartition()? then we can get rid of the flag in 
createPartition() and this function becomes createPartition() + 
loadMetadataAndDiskIds()


http://gerrit.cloudera.org:8080/#/c/5148/11/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 276:    * Returns true if the file corresponding to 'fileStatus' is a 
valid file as per
is a valid data file


Line 278:    * directory or a hidden file starting with . or _ or if it is a 
LZO index file.
hidden file is enough, no need to mention specific prefixes


Line 291:   public static boolean supportsBlockLocationApi(FileSystem fs) {
supportsBlockLocations()?

The API part is confusing because we are not really referencing a specific API. 
If you prefer to reference the API, then let's reference a specific method.


-- 
To view, visit http://gerrit.cloudera.org:8080/5148
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4172/IMPALA-3653: Improvements to block metadata loading

Reply via email to