[
https://issues.apache.org/jira/browse/IMPALA-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091053#comment-18091053
]
ASF subversion and git services commented on IMPALA-13966:
----------------------------------------------------------
Commit 048b951f9dcc5cf646773d5f52f2d77c5e497096 in impala's branch
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=048b951f9 ]
IMPALA-14900: Add support for turning off aggressive decommit
Impala has used TCMalloc's aggressive decommit setting
for several years, but it increases the OS allocation /
deallocation rate and can lead to contention on TCMalloc's
central structures. There are many pieces of code that
still rely on malloc for their memory, including
performance sensitive pieces of query execution. Retaining
some malloc memory can accelerate those codepaths by
avoiding OS allocation / deallocation cycles. TCMalloc
holds a lock while allocating and deallocating memory,
and retaining memory can also avoid extreme cases with
high lock contention. For example, in IMPALA-13966, we
saw performance issues using 1MB Parquet data pages,
because the large allocations bypass the thread caches
and come directly from the central structures.
There is a long history behind this setting:
- As of late 2015 / Impala 2.3, Impala would let tcmalloc
retain memory. It had two mechanisms for releasing
memory. The first was a periodic check to see if the
overhead of tcmalloc exceeded the memory used. The
second was a garbage collection function that ran
when hitting the process memory limit. Both mechanisms
would free ALL excess tcmalloc heap memory via a single
call to ReleaseFreeMemory(). TCMalloc is holding a lock
for this call, and this can stall other work until it
completes. It could be freeing dozens of GBs and this
could hold the lock for 15 seconds. This issue was
reported via IMPALA-2800.
- In IMPALA-3162, Impala moved to gperftools 2.5, which
had aggressive decommit enabled by default. This frees
memory immediately, so the mechanisms to free memory
had nothing to do. This solved IMPALA-2800. The obsolete
code for the periodic check and garbage collection
function were removed in IMPALA-5220.
- Gperftools only had aggressive decommit enabled by
default for a short period of time. It was enabled by
default in 2.4 and was disabled by default in 2.6.
- When Impala upgraded gperftools later, we added code
to manually set aggressive decommit.
This adds back an option to turn off aggressive decommit.
The shape is similar to the old mechanisms: there is a
background thread doing a periodic check to manage the
memory overhead and a garbage collection function that
gets called when hitting the process memory limit. This
has been redesigned to avoid the issue from IMPALA-2800
(based on an early approach to IMPALA-2800 by Todd Lipcon):
- Both enforcement locations are freeing a specific amount
of memory rather than all accumulated memory (i.e. it
calls ReleaseToSystem() with a target amount of memory
to free). The background thread is maintaining an overhead
specified by the tcmalloc_max_free_bytes startup option.
This can be an absolute value or a percentage of the
process memory limit. It defaults to 5% of the process
memory limit. The garbage collection function is
freeing enough memory to avoid hitting the process
memory limit, plus a bit extra (512MB) to avoid calling
the GC function too frequently.
- Both enforcement locations free memory in small chunks
to avoid holding the lock for extended periods of time.
The chunk size is specified by the tcmalloc_garbage_collection_chunk_size
startup option and defaults to 10MB.
- The implementation retains significantly less memory and
frees it without holding the lock for extended periods of
time.
- Other things have changed since then: The buffer pool
retains memory and frees it gradually over time. This also
reduces the need for freeing a large amount of memory
immediately.
Turning off aggressive decommit is currently incompatible with
the madvise_huge_pages=true startup option. This modifies the
startup check so that aggressive decommit can be false if
madvise_huge_pages is false. A future change may provide a
way to mmap huge buffers to allow these to work together.
This adds the --tcmalloc_aggressive_decommit option to
bin/start-impala-cluster.py to make it easier to startup
the cluster. The default value is determined by the
IMPALA_TCMALLOC_AGGRESSIVE_DECOMMIT environment variable,
so this makes it possible to run cluster tests with this
option.
Testing:
- Added a custom cluster test to run TPC-DS with tcmalloc
aggressive decommit off
- Ran a core job with IMPALA_TCMALLOC_AGGRESSIVE_DECOMMIT=false
- Ran the scenario from IMPALA-13966 and verified that turning
off aggressive decommit avoids the issues.
Change-Id: If6022f14093f362a5de9a854f4f4496c90b049b8
Reviewed-on: http://gerrit.cloudera.org:8080/24402
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Heavy scan concurrency on Parquet tables with large page size is slow
> ---------------------------------------------------------------------
>
> Key: IMPALA-13966
> URL: https://issues.apache.org/jira/browse/IMPALA-13966
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.5.0
> Reporter: Michael Smith
> Priority: Major
>
> When reading Parquet tables with large average page size under heavy scan
> concurrency, we see performance significantly slow down.
> Impala writes Iceberg tables with its default page size of 64KB, unless
> {{write.parquet.page-size-bytes}} is explicitly set. The Iceberg library
> itself defaults to 1MB, and other tools - such as Spark - may use that
> default when writing tables.
> I was able to distill an example that demonstrates a substantial difference
> in memory allocation performance for parquet reads when using 1MB page sizes,
> that is not present for 64KB pages.
> # Get a machine with at least 32 real cores (not hyperthreaded) and an SSD.
> # Create an Iceberg table with millions of rows containing a moderately long
> string (hundreds of characters) with a large page size; it's also helpful to
> create a version with the smaller page size. I used the following with and
> without {{write.parquet.page-size-bytes}} (iceberg_small_page) specified
> {code:java}
> create table iceberg_large_page stored by iceberg
> tblproperties('write.parquet.page-size-bytes'='1048576') as select *,
> repeat(l_comment, 10) from tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;{code}
> # Restart Impala with {{-num_io_threads_per_solid_state_disk=32}} to increase
> read parallelism. The SSD should be able to handle it. The goal is to ensure
> we have as many scanners attempting to load and decompress data at the same
> time, with ideally concurrent memory allocation on every thread.
> # Run a query that doesn't process much data outside the scan, and forces
> Impala to read every entry in the long string column
> {code}
> select _c1 from (select _c1, l_shipdate from iceberg_small_page where _c1
> like "%toad%" UNION ALL select _c1, l_shipdate from iceberg_small_page where
> _c1 like "%frog%") x ORDER BY l_shipdate LIMIT 10
> {code}
> I also added IMPALA-13487 to display ParquetDataPagePoolAllocDuration to
> simplify identifying slow allocation performance. One query was sufficient to
> show some difference in performance, with sufficient scanner threads to fully
> utilize all DiskIoMgr threads. The small page query had entries like
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 20.075us ; Min: 0.000ns ; Max:
> 65.999ms ; Sum: 2s802ms ; Number of samples: 139620)
> ParquetUncompressedPageSize: (Avg: 65.72 KB (67296) ; Min: 1.37 KB (1406) ;
> Max: 87.44 KB (89539) ; Sum: 6.14 GB (6590048444) ; Number of samples: 97926)
> {code}
> while the large page query had
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 2.753ms ; Min: 0.000ns ; Max:
> 64.999ms ; Sum: 30s346ms ; Number of samples: 11022)
> ParquetUncompressedPageSize: (Avg: 901.89 KB (923535) ; Min: 360.00 B (360) ;
> Max: 1.00 MB (1048583) ; Sum: 6.14 GB (6597738570) ; Number of samples: 7144)
> {code}
> ParquetUncompressedPageSize shows the difference in page sizes.
> Our theory is that this represents thread contention attempting to access the
> global pool in tcmalloc. TCMalloc maintains per-thread pools for small
> amounts of memory - up to 256KB - but for larger chunks malloc goes to a
> global pool. If that's right, some possible options that could help are
> 1. Try to re-use buffers more across parquet reads, so we don't need to
> allocate memory as frequently.
> 2. Consider a different memory allocator for larger allocations.
> This likely only impacts very high parallelism read-heavy queries. If each
> buffer is used in more processing, the cost of allocation should become a
> smaller part of the query time.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]