[
https://issues.apache.org/jira/browse/IMPALA-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18091611#comment-18091611
]
ASF subversion and git services commented on IMPALA-13966:
----------------------------------------------------------
Commit afc7224bd5a4257c9ce1a6f35141772fd3139838 in impala's branch
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=afc7224bd ]
IMPALA-14702: Add ability to build against Google Tcmalloc
Impala currently uses Gperftools TCMalloc, which was originally
developed by Google but is now its own open source community.
Google continued development internally and created a new open
source project with their improved version. The biggest changes
are:
- Google TCMalloc uses Linux RSEQ functionality to use CPU
caches rather than thread caches. This avoids stranding memory
in inactive threads. It also avoids work when threads start
and stop.
- Google TCMalloc adds native huge page support. It backs most
allocations with huge pages, which can reduce TLB misses.
There are many other changes across many other areas, including
profiling and NUMA support.
This adds support for building against Google TCMalloc. It is
currently controlled by the IMPALA_MALLOC_IMPL environment
variable, which defaults to "gperftools". When set to
"googletcmalloc", it builds against Google TCMalloc. This is
using a custom CMake build of Google TCMalloc with a couple
patches to make it work. Unlike the regular Google TCMalloc,
this uses madvise() with MADV_HUGEPAGE to allow it to function
on systems with only madvise huge page support. Google TCMalloc
requires Abseil, so this adds an Abseil dependency.
Google TCMalloc retains unused memory, and Impala uses the same
integration points as gperftools with aggressive decommit off.
We start a background thread that periodically releases memory.
Unlike gpeftools, Google TCMalloc provides a
MallocExtension::ProcessBackgroundActions() function that does
various maintenance actions and releases memory periodically
to control the memory overhead. Rather than implementing our
own logic, we use that logic and rely on its decisions about
retaining memory. We also register a garbage collection function
to free memory immediately when hitting the process memory limit.
Since Google TCMalloc is aware of huge pages, this changes the
buffer pool's madvise_huge_page to avoid using madvise() when
the malloc implementation natively supports huge pages.
Google TCMalloc's per-CPU caches rely on RSEQ support, and
it's use of RSEQ currently conflicts with glibc's use of
RSEQ. This disables glibc's use of RSEQ via the
GLIBC_TUNABLES=glibc.pthread.rseq=0 when using Google TCMalloc
in the dev environment.
There will be future changes to package this properly.
Testing:
- Ran a core job with IMPALA_MALLOC_IMPL=googletcmalloc
- Tested the scenario from IMPALA-13966 (performance issues with
1MB Parquet data pages) and verified that Google TCMalloc
does not see this issue.
Change-Id: I5a84eacb66eb0a216bfb2159542a0d7e4ddf8ec2
Reviewed-on: http://gerrit.cloudera.org:8080/24403
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Heavy scan concurrency on Parquet tables with large page size is slow
> ---------------------------------------------------------------------
>
> Key: IMPALA-13966
> URL: https://issues.apache.org/jira/browse/IMPALA-13966
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.5.0
> Reporter: Michael Smith
> Priority: Major
>
> When reading Parquet tables with large average page size under heavy scan
> concurrency, we see performance significantly slow down.
> Impala writes Iceberg tables with its default page size of 64KB, unless
> {{write.parquet.page-size-bytes}} is explicitly set. The Iceberg library
> itself defaults to 1MB, and other tools - such as Spark - may use that
> default when writing tables.
> I was able to distill an example that demonstrates a substantial difference
> in memory allocation performance for parquet reads when using 1MB page sizes,
> that is not present for 64KB pages.
> # Get a machine with at least 32 real cores (not hyperthreaded) and an SSD.
> # Create an Iceberg table with millions of rows containing a moderately long
> string (hundreds of characters) with a large page size; it's also helpful to
> create a version with the smaller page size. I used the following with and
> without {{write.parquet.page-size-bytes}} (iceberg_small_page) specified
> {code:java}
> create table iceberg_large_page stored by iceberg
> tblproperties('write.parquet.page-size-bytes'='1048576') as select *,
> repeat(l_comment, 10) from tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;{code}
> # Restart Impala with {{-num_io_threads_per_solid_state_disk=32}} to increase
> read parallelism. The SSD should be able to handle it. The goal is to ensure
> we have as many scanners attempting to load and decompress data at the same
> time, with ideally concurrent memory allocation on every thread.
> # Run a query that doesn't process much data outside the scan, and forces
> Impala to read every entry in the long string column
> {code}
> select _c1 from (select _c1, l_shipdate from iceberg_small_page where _c1
> like "%toad%" UNION ALL select _c1, l_shipdate from iceberg_small_page where
> _c1 like "%frog%") x ORDER BY l_shipdate LIMIT 10
> {code}
> I also added IMPALA-13487 to display ParquetDataPagePoolAllocDuration to
> simplify identifying slow allocation performance. One query was sufficient to
> show some difference in performance, with sufficient scanner threads to fully
> utilize all DiskIoMgr threads. The small page query had entries like
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 20.075us ; Min: 0.000ns ; Max:
> 65.999ms ; Sum: 2s802ms ; Number of samples: 139620)
> ParquetUncompressedPageSize: (Avg: 65.72 KB (67296) ; Min: 1.37 KB (1406) ;
> Max: 87.44 KB (89539) ; Sum: 6.14 GB (6590048444) ; Number of samples: 97926)
> {code}
> while the large page query had
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 2.753ms ; Min: 0.000ns ; Max:
> 64.999ms ; Sum: 30s346ms ; Number of samples: 11022)
> ParquetUncompressedPageSize: (Avg: 901.89 KB (923535) ; Min: 360.00 B (360) ;
> Max: 1.00 MB (1048583) ; Sum: 6.14 GB (6597738570) ; Number of samples: 7144)
> {code}
> ParquetUncompressedPageSize shows the difference in page sizes.
> Our theory is that this represents thread contention attempting to access the
> global pool in tcmalloc. TCMalloc maintains per-thread pools for small
> amounts of memory - up to 256KB - but for larger chunks malloc goes to a
> global pool. If that's right, some possible options that could help are
> 1. Try to re-use buffers more across parquet reads, so we don't need to
> allocate memory as frequently.
> 2. Consider a different memory allocator for larger allocations.
> This likely only impacts very high parallelism read-heavy queries. If each
> buffer is used in more processing, the cost of allocation should become a
> smaller part of the query time.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]