[
https://issues.apache.org/jira/browse/IMPALA-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772625#comment-17772625
]
Riza Suminto commented on IMPALA-11068:
---------------------------------------
IIRC, once a scanner thread is launched, it is committed to read that scan
range. There is no way to retry or put the failed scan range back to queue.
> Query hit OOM under high decompression activity
> -----------------------------------------------
>
> Key: IMPALA-11068
> URL: https://issues.apache.org/jira/browse/IMPALA-11068
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Riza Suminto
> Assignee: Riza Suminto
> Priority: Major
>
> A customer report query hitting OOM over wide table and heavy decompression
> activity. The impala cluster was running with scanner thread parallelism
> (MT_DOP=0).
> The following is the error message shown:
> {code:java}
> Errors: Memory limit exceeded: ParquetColumnChunkReader::InitDictionary()
> failed to allocate 969825 bytes for dictionary.
> HDFS_SCAN_NODE (id=0) could not allocate 947.09 KB without exceeding limit.
> Error occurred on backend [redacted]:22000 by fragment
> d346730dc3a3771e:c24e3ccf00000008
> Memory left in process limit: 233.77 GB
> Memory left in query limit: 503.51 KB
> Query(d346730dc3a3771e:c24e3ccf00000000): Limit=4.13 GB Reservation=3.30 GB
> ReservationLimit=3.30 GB OtherMemory=849.17 MB Total=4.13 GB Peak=4.13 GB
> Fragment d346730dc3a3771e:c24e3ccf00000008: Reservation=3.30 GB
> OtherMemory=849.59 MB Total=4.13 GB Peak=4.13 GB{code}
>
> I look at the corresponding profile of the fragment and notice some key
> counters as follow:
> {code:java}
> Instance d346730dc3a3771e:c24e3ccf00000008 (host=[redacted]:22000)
> ...
> HDFS_SCAN_NODE (id=0)
> ...
> - AverageHdfsReadThreadConcurrency: 8.00 (8.0)
> - AverageScannerThreadConcurrency: 23.00 (23.0)
> - BytesRead: 2.4 GiB (2619685502)
> ...
> - NumScannerThreadMemUnavailable: 1 (1)
> - NumScannerThreadReservationsDenied: 0 (0)
> - NumScannerThreadsStarted: 23 (23)
> - NumScannersWithNoReads: 12 (12)
> - NumStatsFilteredPages: 4,032 (4032)
> - NumStatsFilteredRowGroups: 1 (1)
> - PeakMemoryUsage: 4.1 GiB (4431745197)
> - PeakScannerThreadConcurrency: 23 (23)
> - PerReadThreadRawHdfsThroughput: 842.1 MiB/s (882954163)
> - RemoteScanRanges: 11 (11)
> - RowBatchBytesEnqueued: 1.1 GiB (1221333486)
> - RowBatchQueueGetWaitTime: 1.83s (1833499080)
> - RowBatchQueuePeakMemoryUsage: 599.3 MiB (628430704)
> - RowBatchQueuePutWaitTime: 1ms (1579356)
> - RowBatchesEnqueued: 124 (124)
> - RowsRead: 2,725,888 (2725888)
> - RowsReturned: 0 (0){code}
>
> Based on these counters, I assume following scenario happened:
> # The concurrent scanner thread count peak at 23 (NumScannerThreadsStarted,
> PeakScannerThreadConcurrency).
> # Scanner node seems to try schedule the 24th thread, but backend denies it,
> as indicated by NumScannerThreadMemUnavailable=1.
> # The running threads has been producing output row batches
> (RowBatchesEnqueued=124), but the next exec node above it has not fetch any
> yet (RowsReturned=0). So active scanner threads has been consuming its memory
> reservation, including decompression activity that is happening in
> [parquet-column-chunk-reader.cc|https://github.com/apache/impala/blob/df42225/be/src/exec/parquet/parquet-column-chunk-reader.cc#L155-L177].
> # Just before the scanner node failed, it has consume Reservation=3.30 GB
> and OtherMemory=849.59 MB. So per thread is around Reservation=146.92 MB and
> OtherMemory=36.94 MB. This is close, but slightly higher, to planner initial
> mem-reservation=128.00 MB for scanner node and 32 MB of
> [hdfs_scanner_thread_max_estimated_bytes|https://github.com/apache/impala/blob/df42225/be/src/exec/hdfs-scan-node.cc#L57-L63]
> for decompression usage per thread.
> Note that the 32 MB of hdfs_scanner_thread_max_estimated_bytes is a
> non-reserved bytes. Meaning, they only allocated as needed during column
> chunk decompression, but we think that in most cases they wont require more
> than 32 MB.
> From these insight, I'm suspecting that when scanner node schedule the 23rd
> thread, the memory reservation left was just barely fit the per-thread
> consumption estimate (128.00 MB + 32 MB), and the backend allow it to start.
> As the decompression process goes, one of the scanner thread tried to
> allocate more memory than what is left in reservation at
> ParquetColumnChunkReader::InitDictionary(). If the 23rd thread was not
> launched, we might have enough memory to serve decompression requirement.
> One solution to avoid this OOM is to change our per-thread memory estimation
> in
> [scanner-mem-limiter.cc|https://github.com/apache/impala/blob/df42225/be/src/runtime/scanner-mem-limiter.cc#L59].
> Maybe we should deny reservation once memory spare capacity can not fit 2
> threads allocation consecutively (ie., always leave headroom of 1 thread
> allocation).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]