[
https://issues.apache.org/jira/browse/IMPALA-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944467#comment-17944467
]
Michael Smith commented on IMPALA-13966:
----------------------------------------
[~csringhofer] [~stigahuang] [~joemcdonnell] I think you've all looked at
similar problems to this. Any other or more specific ideas about how to
approach it? Or tickets I've overlooked?
> Heavy scan concurrency on Parquet tables with large page size is slow
> ---------------------------------------------------------------------
>
> Key: IMPALA-13966
> URL: https://issues.apache.org/jira/browse/IMPALA-13966
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.5.0
> Reporter: Michael Smith
> Priority: Major
>
> When reading Parquet tables with large average page size under heavy scan
> concurrency, we see performance significantly slow down.
> Impala writes Iceberg tables with its default page size of 64KB, unless
> {{write.parquet.page-size-bytes}} is explicitly set. The Iceberg library
> itself defaults to 1MB, and other tools - such as Spark - may use that
> default when writing tables.
> I was able to distill an example that demonstrates a substantial difference
> in memory allocation performance for parquet reads when using 1MB page sizes,
> that is not present for 64KB pages.
> # Get a machine with at least 32 real cores (not hyperthreaded) and an SSD.
> # Create an Iceberg table with millions of rows containing a moderately long
> string (hundreds of characters) with a large page size; it's also helpful to
> create a version with the smaller page size. I used the following with and
> without {{write.parquet.page-size-bytes}} (iceberg_small_page) specified
> {code:java}
> create table iceberg_large_page stored by iceberg
> tblproperties('write.parquet.page-size-bytes'='1048576') as select *,
> repeat(l_comment, 10) from tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from
> tpch.lineitem;{code}
> # Restart Impala with {{-num_io_threads_per_solid_state_disk=32}} to increase
> read parallelism. The SSD should be able to handle it. The goal is to ensure
> we have as many scanners attempting to load and decompress data at the same
> time, with ideally concurrent memory allocation on every thread.
> # Run a query that doesn't process much data outside the scan, and forces
> Impala to read every entry in the long string column
> {code}
> select _c1 from (select _c1, l_shipdate from iceberg_small_page where _c1
> like "%toad%" UNION ALL select _c1, l_shipdate from iceberg_small_page where
> _c1 like "%frog%") x ORDER BY l_shipdate LIMIT 10
> {code}
> I also added IMPALA-13487 to display ParquetDataPagePoolAllocDuration to
> simplify identifying slow allocation performance. One query was sufficient to
> show some difference in performance, with sufficient scanner threads to fully
> utilize all DiskIoMgr threads. The small page query had entries like
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 20.075us ; Min: 0.000ns ; Max:
> 65.999ms ; Sum: 2s802ms ; Number of samples: 139620)
> ParquetUncompressedPageSize: (Avg: 65.72 KB (67296) ; Min: 1.37 KB (1406) ;
> Max: 87.44 KB (89539) ; Sum: 6.14 GB (6590048444) ; Number of samples: 97926)
> {code}
> while the large page query had
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 2.753ms ; Min: 0.000ns ; Max:
> 64.999ms ; Sum: 30s346ms ; Number of samples: 11022)
> ParquetUncompressedPageSize: (Avg: 901.89 KB (923535) ; Min: 360.00 B (360) ;
> Max: 1.00 MB (1048583) ; Sum: 6.14 GB (6597738570) ; Number of samples: 7144)
> {code}
> ParquetUncompressedPageSize shows the difference in page sizes.
> Our theory is that this represents thread contention attempting to access the
> global pool in tcmalloc. TCMalloc maintains per-thread pools for small
> amounts of memory - up to 256KB - but for larger chunks malloc goes to a
> global pool. If that's right, some possible options that could help are
> 1. Try to re-use buffers more across parquet reads, so we don't need to
> allocate memory as frequently.
> 2. Consider a different memory allocator for larger allocations.
> This likely only impacts very high parallelism read-heavy queries. If each
> buffer is used in more processing, the cost of allocation should become a
> smaller part of the query time.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]