[jira] [Commented] (IMPALA-13966) Heavy scan concurrency on Parquet tables with large page size is slow

Joe McDonnell (Jira) Wed, 10 Jun 2026 18:11:18 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088145#comment-18088145
 ]


Joe McDonnell commented on IMPALA-13966:
----------------------------------------

I tested this scenario on a c7a.12xlarge (48 non-hyperthreaded cores) with the 
[http://issues.apache.org/jira/browse/IMPALA-14900] patch to allow running with 
aggressive decommit off and the 
https://issues.apache.org/jira/browse/IMPALA-14702 patch to support Google 
TCMalloc. Here is what I see:
||Configuration||Query Time for Normal Pages (s)||Query Time for 1MB Pages (s)||
|Default upstream master|1.95|5.04|
|Aggressive decommit disabled|1.46|1.67|
|Google TCMalloc|1.45|1.67|

Some minor differences from the original description:
 * Used 10 insert statements rather than 4, so the tables each have 66013365 
rows
 * Used long_polling_time_ms=100
 * Used a single Impalad (which increases the lock contention over 3 impalads 
with separate locks)

The ParquetDataPagePoolAllocDuration metrics shows major improvement with 
either aggressive decommit off or Google TCMalloc (and minimal differences 
between those two):
{noformat}
Default upstream master:
  Normal pages:
   - ParquetDataPagePoolAllocBytes: (Avg: 63.85 KB (65378) ; Min: 4.00 KB 
(4096) ; Max: 65.31 KB (66876) ; Sum: 16.29 GB (17493565239) ; Number of 
samples: 267573)
   - ParquetDataPagePoolAllocDuration: (Avg: 6.689us ; Min: 20.000ns ; Max: 
2.318ms ; Sum: 1s789ms ; Number of samples: 267573)

  Large pages:
    - ParquetDataPagePoolAllocBytes: (Avg: 940.01 KB (962570) ; Min: 9.02 KB 
(9237) ; Max: 1.00 MB (1048583) ; Sum: 16.29 GB (17491827653) ; Number of 
samples: 18172)
    - ParquetDataPagePoolAllocDuration: (Avg: 2.257ms ; Min: 40.000ns ; Max: 
32.247ms ; Sum: 41s016ms ; Number of samples: 18172)

Aggressive decommit disabled:
  Normal pages:
   - ParquetDataPagePoolAllocBytes: (Avg: 63.85 KB (65378) ; Min: 4.00 KB 
(4096) ; Max: 65.31 KB (66876) ; Sum: 16.29 GB (17493565239) ; Number of 
samples: 267573)
   - ParquetDataPagePoolAllocDuration: (Avg: 1.072us ; Min: 20.000ns ; Max: 
693.267us ; Sum: 286.929ms ; Number of samples: 267573)

  Large pages:
   - ParquetDataPagePoolAllocBytes: (Avg: 940.01 KB (962570) ; Min: 9.02 KB 
(9237) ; Max: 1.00 MB (1048583) ; Sum: 16.29 GB (17491827653) ; Number of 
samples: 18172)
   - ParquetDataPagePoolAllocDuration: (Avg: 1.064us ; Min: 40.000ns ; Max: 
220.682us ; Sum: 19.345ms ; Number of samples: 18172)

Google TCMalloc:
  Normal pages:
   - ParquetDataPagePoolAllocBytes: (Avg: 63.85 KB (65378) ; Min: 4.00 KB 
(4096) ; Max: 65.31 KB (66876) ; Sum: 16.29 GB (17493565239) ; Number of 
samples: 267573)
   - ParquetDataPagePoolAllocDuration: (Avg: 1.067us ; Min: 20.000ns ; Max: 
83.130us ; Sum: 285.663ms ; Number of samples: 267573)

  Large pages:
   - ParquetDataPagePoolAllocBytes: (Avg: 940.01 KB (962570) ; Min: 9.02 KB 
(9237) ; Max: 1.00 MB (1048583) ; Sum: 16.29 GB (17491827653) ; Number of 
samples: 18172)
   - ParquetDataPagePoolAllocDuration: (Avg: 1.278us ; Min: 40.000ns ; Max: 
27.120us ; Sum: 23.236ms ; Number of samples: 18172){noformat}
There is still a difference between small pages and 1MB pages, but it doesn't 
seem related to malloc.

> Heavy scan concurrency on Parquet tables with large page size is slow
> ---------------------------------------------------------------------
>
>                 Key: IMPALA-13966
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13966
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.5.0
>            Reporter: Michael Smith
>            Priority: Major
>
> When reading Parquet tables with large average page size under heavy scan 
> concurrency, we see performance significantly slow down.
> Impala writes Iceberg tables with its default page size of 64KB, unless 
> {{write.parquet.page-size-bytes}} is explicitly set. The Iceberg library 
> itself defaults to 1MB, and other tools - such as Spark - may use that 
> default when writing tables.
> I was able to distill an example that demonstrates a substantial difference 
> in memory allocation performance for parquet reads when using 1MB page sizes, 
> that is not present for 64KB pages.
> # Get a machine with at least 32 real cores (not hyperthreaded) and an SSD.
> # Create an Iceberg table with millions of rows containing a moderately long 
> string (hundreds of characters) with a large page size; it's also helpful to 
> create a version with the smaller page size. I used the following with and 
> without {{write.parquet.page-size-bytes}} (iceberg_small_page) specified
> {code:java}
> create table iceberg_large_page stored by iceberg 
> tblproperties('write.parquet.page-size-bytes'='1048576') as select *, 
> repeat(l_comment, 10) from tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from 
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from 
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from 
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from 
> tpch.lineitem;{code}
> # Restart Impala with {{-num_io_threads_per_solid_state_disk=32}} to increase 
> read parallelism. The SSD should be able to handle it. The goal is to ensure 
> we have as many scanners attempting to load and decompress data at the same 
> time, with ideally concurrent memory allocation on every thread.
> # Run a query that doesn't process much data outside the scan, and forces 
> Impala to read every entry in the long string column
> {code}
> select _c1 from (select _c1, l_shipdate from iceberg_small_page where _c1 
> like "%toad%" UNION ALL select _c1, l_shipdate from iceberg_small_page where 
> _c1 like "%frog%") x ORDER BY l_shipdate LIMIT 10
> {code}
> I also added IMPALA-13487 to display ParquetDataPagePoolAllocDuration to 
> simplify identifying slow allocation performance. One query was sufficient to 
> show some difference in performance, with sufficient scanner threads to fully 
> utilize all DiskIoMgr threads. The small page query had entries like
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 20.075us ; Min: 0.000ns ; Max: 
> 65.999ms ; Sum: 2s802ms ; Number of samples: 139620)
> ParquetUncompressedPageSize: (Avg: 65.72 KB (67296) ; Min: 1.37 KB (1406) ; 
> Max: 87.44 KB (89539) ; Sum: 6.14 GB (6590048444) ; Number of samples: 97926)
> {code}
> while the large page query had
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 2.753ms ; Min: 0.000ns ; Max: 
> 64.999ms ; Sum: 30s346ms ; Number of samples: 11022)
> ParquetUncompressedPageSize: (Avg: 901.89 KB (923535) ; Min: 360.00 B (360) ; 
> Max: 1.00 MB (1048583) ; Sum: 6.14 GB (6597738570) ; Number of samples: 7144)
> {code}
> ParquetUncompressedPageSize shows the difference in page sizes.
> Our theory is that this represents thread contention attempting to access the 
> global pool in tcmalloc. TCMalloc maintains per-thread pools for small 
> amounts of memory - up to 256KB - but for larger chunks malloc goes to a 
> global pool. If that's right, some possible options that could help are
> 1. Try to re-use buffers more across parquet reads, so we don't need to 
> allocate memory as frequently.
> 2. Consider a different memory allocator for larger allocations.
> This likely only impacts very high parallelism read-heavy queries. If each 
> buffer is used in more processing, the cost of allocation should become a 
> smaller part of the query time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13966) Heavy scan concurrency on Parquet tables with large page size is slow

Reply via email to