Internal Jenkins has submitted this change and it was merged. Change subject: IMPALA-4172/IMPALA-3653: Improvements to block metadata loading ......................................................................
IMPALA-4172/IMPALA-3653: Improvements to block metadata loading This patch improves the block metadata loading (locations and disk storage IDs) for partitioned and un-partitioned tables in the Catalog server. Without this patch: ------------------ We loop through each and every file in the table/partition directories and call getFileBlockLocations() on it to obtain the block metadata. This results in large number of RPC calls to the Namenode, especially for tables with large no. of files/partitions. With this patch: --------------- We move the block metadata querying to use listStatus() call which accepts a directory as input and fetches the 'BlockLocation' objects for every file recursively in that directory. This improves the metadata loading in the following ways. - For non-partitioned tables, we query all the BlockLocations in a single RPC call in the base table directory and load the corresponding disk IDs. - For partitioned tables, we query the BlockLocations for all the partitions residing under the base table directories in a single RPC and then load every partition with non-default partition directory separately. - REFRESH on a table reloads the block metadata from scratch for every data file every time. So it can be used as a replacement for invalidate in situations like HDFS block rebalancing which needs block metadata update. Also, this patch does away with VolumeIds returned by the HDFS NN and uses the new StorageIDs returned by the BlockLocation class. These StorageIDs are UUID strings and hence are mapped to a per-node 0-based index as expected by the backend. In the upcoming versions of Hadoop APIs, getFileBlockStorageLocations() is deprecated and instead the listStatus() returns BlockLocations with storage IDs embedded. This patch makes use of this improvement to reduce an additional RPC to the NN to fetch the storage locations. Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26 Reviewed-on: http://gerrit.cloudera.org:8080/5148 Reviewed-by: Bharath Vissapragada <bhara...@cloudera.com> Tested-by: Internal Jenkins --- M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java A fe/src/main/java/org/apache/impala/catalog/DiskIdMapper.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java 6 files changed, 367 insertions(+), 341 deletions(-) Approvals: Bharath Vissapragada: Looks good to me, approved Internal Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/5148 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ie127658172e6e70dae441374530674a4ac9d5d26 Gerrit-PatchSet: 20 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com> Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>