Csaba Ringhofer has uploaded a new patch set (#3). ( 
http://gerrit.cloudera.org:8080/20434 )

Change subject: IMPALA-12408: Optimize HdfsScanNode.computeScanRangeLocations()
......................................................................

IMPALA-12408: Optimize HdfsScanNode.computeScanRangeLocations()

computeScanRangeLocations() could be very slow for tables
with large number of partitions. This patch tries to minimize
the use of two expensive function calls:
1. HdfsPartition.getLocation()
  - This looks like a simple property but actually decompresses
    the location string.
  - Was often called indirectly through getFsType().
  - After the patch it is only called once per partition.
2. hadoop.fs.FileSystem.getFileSystem()
  - Hadoop caches the FileSystem object but the key contains
    UserGroupInformation which is obtained with
    UserGroupInformation.getCurrentUser(), making the call costly.
  - As the user is always the same in Impala we can cache it simply
    by scheme + authority part of the location URI. After the patch
    getFileSystem() is called if scheme/authority is different than
    in the previous partition, leading to a single call for most
    tables.

Note that caching these values in HdfsPartition could also help
but preferred to avoid increasing the size of that class.

The patch also changes how we count the number of partitions
per file system (to avoid the extra calls to getFsType()). This
made class SampledPartitionMetadata unnecessary and reverted some
of the changes in https://gerrit.cloudera.org/#/c/12282/

Benchmarks:
Measured using tpcds.store_sales (1824 partitiones)
union all'd 256 times:
explain select * from tpcds_parquet.store_sales256;
Before patch: 8.8s
After patch: 1.1s

The improvement is also visible on full tpcds benchmark:
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCDS(2) | parquet / none / none | 0.53    | -8.99%     | 0.29       | 
-10.78%        |
+----------+-----------------------+---------+------------+------------+----------------+
The effect is less significant on higher scale factors.

Testing:
- ran core tests

Change-Id: Icf3e9c169d65c15df6a6762cc68fbb477fe64a7c
---
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
6 files changed, 82 insertions(+), 84 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/20434/3
--
To view, visit http://gerrit.cloudera.org:8080/20434
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icf3e9c169d65c15df6a6762cc68fbb477fe64a7c
Gerrit-Change-Number: 20434
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <[email protected]>

Reply via email to