[jira] [Commented] (IMPALA-13966) Heavy scan concurrency on Parquet tables with large page size is slow

ASF subversion and git services (Jira) Tue, 23 Jun 2026 19:40:08 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091053#comment-18091053
 ]


ASF subversion and git services commented on IMPALA-13966:
----------------------------------------------------------

Commit 048b951f9dcc5cf646773d5f52f2d77c5e497096 in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=048b951f9 ]

IMPALA-14900: Add support for turning off aggressive decommit

Impala has used TCMalloc's aggressive decommit setting
for several years, but it increases the OS allocation /
deallocation rate and can lead to contention on TCMalloc's
central structures. There are many pieces of code that
still rely on malloc for their memory, including
performance sensitive pieces of query execution. Retaining
some malloc memory can accelerate those codepaths by
avoiding OS allocation / deallocation cycles. TCMalloc
holds a lock while allocating and deallocating memory,
and retaining memory can also avoid extreme cases with
high lock contention. For example, in IMPALA-13966, we
saw performance issues using 1MB Parquet data pages,
because the large allocations bypass the thread caches
and come directly from the central structures.

There is a long history behind this setting:
 - As of late 2015 / Impala 2.3, Impala would let tcmalloc
   retain memory. It had two mechanisms for releasing
   memory. The first was a periodic check to see if the
   overhead of tcmalloc exceeded the memory used. The
   second was a garbage collection function that ran
   when hitting the process memory limit. Both mechanisms
   would free ALL excess tcmalloc heap memory via a single
   call to ReleaseFreeMemory(). TCMalloc is holding a lock
   for this call, and this can stall other work until it
   completes. It could be freeing dozens of GBs and this
   could hold the lock for 15 seconds. This issue was
   reported via IMPALA-2800.
 - In IMPALA-3162, Impala moved to gperftools 2.5, which
   had aggressive decommit enabled by default. This frees
   memory immediately, so the mechanisms to free memory
   had nothing to do. This solved IMPALA-2800. The obsolete
   code for the periodic check and garbage collection
   function were removed in IMPALA-5220.
 - Gperftools only had aggressive decommit enabled by
   default for a short period of time. It was enabled by
   default in 2.4 and was disabled by default in 2.6.
 - When Impala upgraded gperftools later, we added code
   to manually set aggressive decommit.

This adds back an option to turn off aggressive decommit.
The shape is similar to the old mechanisms: there is a
background thread doing a periodic check to manage the
memory overhead and a garbage collection function that
gets called when hitting the process memory limit. This
has been redesigned to avoid the issue from IMPALA-2800
(based on an early approach to IMPALA-2800 by Todd Lipcon):
 - Both enforcement locations are freeing a specific amount
   of memory rather than all accumulated memory (i.e. it
   calls ReleaseToSystem() with a target amount of memory
   to free). The background thread is maintaining an overhead
   specified by the tcmalloc_max_free_bytes startup option.
   This can be an absolute value or a percentage of the
   process memory limit. It defaults to 5% of the process
   memory limit. The garbage collection function is
   freeing enough memory to avoid hitting the process
   memory limit, plus a bit extra (512MB) to avoid calling
   the GC function too frequently.
 - Both enforcement locations free memory in small chunks
   to avoid holding the lock for extended periods of time.
   The chunk size is specified by the tcmalloc_garbage_collection_chunk_size
   startup option and defaults to 10MB.
 - The implementation retains significantly less memory and
   frees it without holding the lock for extended periods of
   time.
 - Other things have changed since then: The buffer pool
   retains memory and frees it gradually over time. This also
   reduces the need for freeing a large amount of memory
   immediately.

Turning off aggressive decommit is currently incompatible with
the madvise_huge_pages=true startup option. This modifies the
startup check so that aggressive decommit can be false if
madvise_huge_pages is false. A future change may provide a
way to mmap huge buffers to allow these to work together.

This adds the --tcmalloc_aggressive_decommit option to
bin/start-impala-cluster.py to make it easier to startup
the cluster. The default value is determined by the
IMPALA_TCMALLOC_AGGRESSIVE_DECOMMIT environment variable,
so this makes it possible to run cluster tests with this
option.

Testing:
 - Added a custom cluster test to run TPC-DS with tcmalloc
   aggressive decommit off
 - Ran a core job with IMPALA_TCMALLOC_AGGRESSIVE_DECOMMIT=false
 - Ran the scenario from IMPALA-13966 and verified that turning
   off aggressive decommit avoids the issues.

Change-Id: If6022f14093f362a5de9a854f4f4496c90b049b8
Reviewed-on: http://gerrit.cloudera.org:8080/24402
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Heavy scan concurrency on Parquet tables with large page size is slow
> ---------------------------------------------------------------------
>
>                 Key: IMPALA-13966
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13966
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.5.0
>            Reporter: Michael Smith
>            Priority: Major
>
> When reading Parquet tables with large average page size under heavy scan 
> concurrency, we see performance significantly slow down.
> Impala writes Iceberg tables with its default page size of 64KB, unless 
> {{write.parquet.page-size-bytes}} is explicitly set. The Iceberg library 
> itself defaults to 1MB, and other tools - such as Spark - may use that 
> default when writing tables.
> I was able to distill an example that demonstrates a substantial difference 
> in memory allocation performance for parquet reads when using 1MB page sizes, 
> that is not present for 64KB pages.
> # Get a machine with at least 32 real cores (not hyperthreaded) and an SSD.
> # Create an Iceberg table with millions of rows containing a moderately long 
> string (hundreds of characters) with a large page size; it's also helpful to 
> create a version with the smaller page size. I used the following with and 
> without {{write.parquet.page-size-bytes}} (iceberg_small_page) specified
> {code:java}
> create table iceberg_large_page stored by iceberg 
> tblproperties('write.parquet.page-size-bytes'='1048576') as select *, 
> repeat(l_comment, 10) from tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from 
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from 
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from 
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from 
> tpch.lineitem;{code}
> # Restart Impala with {{-num_io_threads_per_solid_state_disk=32}} to increase 
> read parallelism. The SSD should be able to handle it. The goal is to ensure 
> we have as many scanners attempting to load and decompress data at the same 
> time, with ideally concurrent memory allocation on every thread.
> # Run a query that doesn't process much data outside the scan, and forces 
> Impala to read every entry in the long string column
> {code}
> select _c1 from (select _c1, l_shipdate from iceberg_small_page where _c1 
> like "%toad%" UNION ALL select _c1, l_shipdate from iceberg_small_page where 
> _c1 like "%frog%") x ORDER BY l_shipdate LIMIT 10
> {code}
> I also added IMPALA-13487 to display ParquetDataPagePoolAllocDuration to 
> simplify identifying slow allocation performance. One query was sufficient to 
> show some difference in performance, with sufficient scanner threads to fully 
> utilize all DiskIoMgr threads. The small page query had entries like
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 20.075us ; Min: 0.000ns ; Max: 
> 65.999ms ; Sum: 2s802ms ; Number of samples: 139620)
> ParquetUncompressedPageSize: (Avg: 65.72 KB (67296) ; Min: 1.37 KB (1406) ; 
> Max: 87.44 KB (89539) ; Sum: 6.14 GB (6590048444) ; Number of samples: 97926)
> {code}
> while the large page query had
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 2.753ms ; Min: 0.000ns ; Max: 
> 64.999ms ; Sum: 30s346ms ; Number of samples: 11022)
> ParquetUncompressedPageSize: (Avg: 901.89 KB (923535) ; Min: 360.00 B (360) ; 
> Max: 1.00 MB (1048583) ; Sum: 6.14 GB (6597738570) ; Number of samples: 7144)
> {code}
> ParquetUncompressedPageSize shows the difference in page sizes.
> Our theory is that this represents thread contention attempting to access the 
> global pool in tcmalloc. TCMalloc maintains per-thread pools for small 
> amounts of memory - up to 256KB - but for larger chunks malloc goes to a 
> global pool. If that's right, some possible options that could help are
> 1. Try to re-use buffers more across parquet reads, so we don't need to 
> allocate memory as frequently.
> 2. Consider a different memory allocator for larger allocations.
> This likely only impacts very high parallelism read-heavy queries. If each 
> buffer is used in more processing, the cost of allocation should become a 
> smaller part of the query time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13966) Heavy scan concurrency on Parquet tables with large page size is slow

Reply via email to