Hello Daniel Becker, Zoltan Borok-Nagy, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/20434

to look at the new patch set (#5).

Change subject: IMPALA-12408: Optimize HdfsScanNode.computeScanRangeLocations()
......................................................................

IMPALA-12408: Optimize HdfsScanNode.computeScanRangeLocations()

computeScanRangeLocations() could be very slow for tables
with large number of partitions. This patch tries to minimize
the use of two expensive function calls:
1. HdfsPartition.getLocation()
  - This looks like a simple property but actually decompresses
    the location string.
  - Was often called indirectly through getFsType().
  - After the patch it is only called once per partition.
2. hadoop.fs.FileSystem.getFileSystem()
  - Hadoop caches the FileSystem object but the key contains
    UserGroupInformation which is obtained with
    UserGroupInformation.getCurrentUser(), making the call costly.
  - As the user is always the same in Impala we can cache it simply
    by scheme + authority part of the location URI. After the patch
    getFileSystem() is called if scheme/authority is different than
    in the previous partition, leading to a single call for most
    tables.

Note that caching these values in HdfsPartition could also help
but preferred to avoid increasing the size of that class.

The patch also changes the implementation of how we count the number
of partitions per file system (to avoid the extra calls to
getFsType()). This made class SampledPartitionMetadata unnecessary and
reverted some of the changes in https://gerrit.cloudera.org/#/c/12282/

Benchmarks:
Measured using tpcds.store_sales (1824 partitions)
union all'd 256 times:
explain select * from tpcds_parquet.store_sales256;
Before patch: 8.8s
After patch: 1.1s

The improvement is also visible on full tpcds benchmark:
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCDS(2) | parquet / none / none | 0.53    | -8.99%     | 0.29       | 
-10.78%        |
+----------+-----------------------+---------+------------+------------+----------------+
The effect is less significant on higher scale factors.

Testing:
- ran core tests

Change-Id: Icf3e9c169d65c15df6a6762cc68fbb477fe64a7c
---
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/FeFsTable.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java
6 files changed, 80 insertions(+), 85 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/20434/5
--
To view, visit http://gerrit.cloudera.org:8080/20434
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icf3e9c169d65c15df6a6762cc68fbb477fe64a7c
Gerrit-Change-Number: 20434
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>

Reply via email to