[
https://issues.apache.org/jira/browse/IMPALA-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088145#comment-18088145
]
Joe McDonnell commented on IMPALA-13966:
----------------------------------------
I tested this scenario on a c7a.12xlarge (48 non-hyperthreaded cores) with the
[http://issues.apache.org/jira/browse/IMPALA-14900] patch to allow running with
aggressive decommit off and the
https://issues.apache.org/jira/browse/IMPALA-14702 patch to support Google
TCMalloc. Here is what I see:
||Configuration||Query Time for Normal Pages (s)||Query Time for 1MB Pages (s)||
|Default upstream master|1.95|5.04|
|Aggressive decommit disabled|1.46|1.67|
|Google TCMalloc|1.45|1.67|
Some minor differences from the original description:
* Used 10 insert statements rather than 4, so the tables each have 66013365
rows
* Used long_polling_time_ms=100
* Used a single Impalad (which increases the lock contention over 3 impalads
with separate locks)
The ParquetDataPagePoolAllocDuration metrics shows major improvement with
either aggressive decommit off or Google TCMalloc (and minimal differences
between those two):
{noformat}
Default upstream master:
Normal pages:
- ParquetDataPagePoolAllocBytes: (Avg: 63.85 KB (65378) ; Min: 4.00 KB
(4096) ; Max: 65.31 KB (66876) ; Sum: 16.29 GB (17493565239) ; Number of
samples: 267573)
- ParquetDataPagePoolAllocDuration: (Avg: 6.689us ; Min: 20.000ns ; Max:
2.318ms ; Sum: 1s789ms ; Number of samples: 267573)
Large pages:
- ParquetDataPagePoolAllocBytes: (Avg: 940.01 KB (962570) ; Min: 9.02 KB
(9237) ; Max: 1.00 MB (1048583) ; Sum: 16.29 GB (17491827653) ; Number of
samples: 18172)
- ParquetDataPagePoolAllocDuration: (Avg: 2.257ms ; Min: 40.000ns ; Max:
32.247ms ; Sum: 41s016ms ; Number of samples: 18172)
Aggressive decommit disabled:
Normal pages:
- ParquetDataPagePoolAllocBytes: (Avg: 63.85 KB (65378) ; Min: 4.00 KB
(4096) ; Max: 65.31 KB (66876) ; Sum: 16.29 GB (17493565239) ; Number of
samples: 267573)
- ParquetDataPagePoolAllocDuration: (Avg: 1.072us ; Min: 20.000ns ; Max:
693.267us ; Sum: 286.929ms ; Number of samples: 267573)
Large pages:
- ParquetDataPagePoolAllocBytes: (Avg: 940.01 KB (962570) ; Min: 9.02 KB
(9237) ; Max: 1.00 MB (1048583) ; Sum: 16.29 GB (17491827653) ; Number of
samples: 18172)
- ParquetDataPagePoolAllocDuration: (Avg: 1.064us ; Min: 40.000ns ; Max:
220.682us ; Sum: 19.345ms ; Number of samples: 18172)
Google TCMalloc:
Normal pages:
- ParquetDataPagePoolAllocBytes: (Avg: 63.85 KB (65378) ; Min: 4.00 KB
(4096) ; Max: 65.31 KB (66876) ; Sum: 16.29 GB (17493565239) ; Number of
samples: 267573)
- ParquetDataPagePoolAllocDuration: (Avg: 1.067us ; Min: 20.000ns ; Max:
83.130us ; Sum: 285.663ms ; Number of samples: 267573)
Large pages:
- ParquetDataPagePoolAllocBytes: (Avg: 940.01 KB (962570) ; Min: 9.02 KB
(9237) ; Max: 1.00 MB (1048583) ; Sum: 16.29 GB (17491827653) ; Number of
samples: 18172)
- ParquetDataPagePoolAllocDuration: (Avg: 1.278us ; Min: 40.000ns ; Max:
27.120us ; Sum: 23.236ms ; Number of samples: 18172){noformat}
There is still a difference between small pages and 1MB pages, but it doesn't
seem related to malloc.
> Heavy scan concurrency on Parquet tables with large page size is slow
> ---------------------------------------------------------------------
>
> Key: IMPALA-13966
> URL: https://issues.apache.org/jira/browse/IMPALA-13966
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.5.0
> Reporter: Michael Smith
> Priority: Major
>
> When reading Parquet tables with large average page size under heavy scan
> concurrency, we see performance significantly slow down.
> Impala writes Iceberg tables with its default page size of 64KB, unless
> {{write.parquet.page-size-bytes}} is explicitly set. The Iceberg library
> itself defaults to 1MB, and other tools - such as Spark - may use that
> default when writing tables.
> I was able to distill an example that demonstrates a substantial difference
> in memory allocation performance for parquet reads when using 1MB page sizes,
> that is not present for 64KB pages.
> # Get a machine with at least 32 real cores (not hyperthreaded) and an SSD.
> # Create an Iceberg table with millions of rows containing a moderately long
> string (hundreds of characters) with a large page size; it's also helpful to
> create a version with the smaller page size. I used the following with and
> without {{write.parquet.page-size-bytes}} (iceberg_small_page) specified
> {code:java}
> create table iceberg_large_page stored by iceberg
> tblproperties('write.parquet.page-size-bytes'='1048576') as select *,
> repeat(l_comment, 10) from tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;{code}
> # Restart Impala with {{-num_io_threads_per_solid_state_disk=32}} to increase
> read parallelism. The SSD should be able to handle it. The goal is to ensure
> we have as many scanners attempting to load and decompress data at the same
> time, with ideally concurrent memory allocation on every thread.
> # Run a query that doesn't process much data outside the scan, and forces
> Impala to read every entry in the long string column
> {code}
> select _c1 from (select _c1, l_shipdate from iceberg_small_page where _c1
> like "%toad%" UNION ALL select _c1, l_shipdate from iceberg_small_page where
> _c1 like "%frog%") x ORDER BY l_shipdate LIMIT 10
> {code}
> I also added IMPALA-13487 to display ParquetDataPagePoolAllocDuration to
> simplify identifying slow allocation performance. One query was sufficient to
> show some difference in performance, with sufficient scanner threads to fully
> utilize all DiskIoMgr threads. The small page query had entries like
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 20.075us ; Min: 0.000ns ; Max:
> 65.999ms ; Sum: 2s802ms ; Number of samples: 139620)
> ParquetUncompressedPageSize: (Avg: 65.72 KB (67296) ; Min: 1.37 KB (1406) ;
> Max: 87.44 KB (89539) ; Sum: 6.14 GB (6590048444) ; Number of samples: 97926)
> {code}
> while the large page query had
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 2.753ms ; Min: 0.000ns ; Max:
> 64.999ms ; Sum: 30s346ms ; Number of samples: 11022)
> ParquetUncompressedPageSize: (Avg: 901.89 KB (923535) ; Min: 360.00 B (360) ;
> Max: 1.00 MB (1048583) ; Sum: 6.14 GB (6597738570) ; Number of samples: 7144)
> {code}
> ParquetUncompressedPageSize shows the difference in page sizes.
> Our theory is that this represents thread contention attempting to access the
> global pool in tcmalloc. TCMalloc maintains per-thread pools for small
> amounts of memory - up to 256KB - but for larger chunks malloc goes to a
> global pool. If that's right, some possible options that could help are
> 1. Try to re-use buffers more across parquet reads, so we don't need to
> allocate memory as frequently.
> 2. Consider a different memory allocator for larger allocations.
> This likely only impacts very high parallelism read-heavy queries. If each
> buffer is used in more processing, the cost of allocation should become a
> smaller part of the query time.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]