[ 
https://issues.apache.org/jira/browse/IMPALA-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773835#comment-17773835
 ] 

ASF subversion and git services commented on IMPALA-11068:
----------------------------------------------------------

Commit fd0d88d8dcd0beca668af2c89c77ab1a30db79d0 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=fd0d88d8d ]

IMPALA-11068: Add query option to reduce scanner thread launch.

Under heavy decompression workload, Impala running with scanner thread
parallelism (MT_DOP=0) can still hit OOM error due to launching too many
threads too soon. We have logic in ScannerMemLimiter to limit the number
of scanner threads by calculating the thread's memory requirement and
estimating the memory growth rate of all threads. However, it does not
prevent a scanner node from quickly launching many threads and
immediately reaching the memtracker's spare capacity. Even after
ScannerMemLimiter rejects a new thread launch, some existing threads
might continue increasing their non-reserved memory for decompression
work until the memory limit exceeded.

IMPALA-7096 adds hdfs_scanner_thread_max_estimated_bytes flag as a
heuristic to count for non-reserved memory growth. Increasing this flag
value can help reduce thread count, but might severely regress other
queries that do not have heavy decompression characteristics. Similarly
with lowering the NUM_SCANNER_THREADS query option.

This patch adds one more query option as an alternative to mitigate OOM
called HDFS_SCANNER_NON_RESERVED_BYTES. This option is intended to offer
the same control as hdfs_scanner_thread_max_estimated_bytes, but as a
query option such that tuning can be done at per query granularity. If
this query option not set, set to 0, or negative value, backend will
revert to use the value of hdfs_scanner_thread_max_estimated_bytes flag.

Testing:
- Add test case in query-options-test.cc and
  TestScanMemLimit::test_hdfs_scanner_thread_mem_scaling.

Change-Id: I03cadf1230eed00d69f2890c82476c6861e37466
Reviewed-on: http://gerrit.cloudera.org:8080/18126
Reviewed-by: Csaba Ringhofer <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Query hit OOM under high decompression activity
> -----------------------------------------------
>
>                 Key: IMPALA-11068
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11068
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Riza Suminto
>            Assignee: Riza Suminto
>            Priority: Major
>             Fix For: Impala 4.4.0
>
>
> A customer report query hitting OOM over wide table and heavy decompression 
> activity. The impala cluster was running with scanner thread parallelism 
> (MT_DOP=0).
> The following is the error message shown:
> {code:java}
> Errors: Memory limit exceeded: ParquetColumnChunkReader::InitDictionary() 
> failed to allocate 969825 bytes for dictionary.
> HDFS_SCAN_NODE (id=0) could not allocate 947.09 KB without exceeding limit.
> Error occurred on backend [redacted]:22000 by fragment 
> d346730dc3a3771e:c24e3ccf00000008
> Memory left in process limit: 233.77 GB
> Memory left in query limit: 503.51 KB
> Query(d346730dc3a3771e:c24e3ccf00000000): Limit=4.13 GB Reservation=3.30 GB 
> ReservationLimit=3.30 GB OtherMemory=849.17 MB Total=4.13 GB Peak=4.13 GB
> Fragment d346730dc3a3771e:c24e3ccf00000008: Reservation=3.30 GB 
> OtherMemory=849.59 MB Total=4.13 GB Peak=4.13 GB{code}
>  
> I look at the corresponding profile of the fragment and notice some key 
> counters as follow:
> {code:java}
>       Instance d346730dc3a3771e:c24e3ccf00000008 (host=[redacted]:22000)
>       ...
>           HDFS_SCAN_NODE (id=0)
>           ...
>             - AverageHdfsReadThreadConcurrency: 8.00 (8.0)
>             - AverageScannerThreadConcurrency: 23.00 (23.0)
>             - BytesRead: 2.4 GiB (2619685502)
>             ...
>             - NumScannerThreadMemUnavailable: 1 (1)
>             - NumScannerThreadReservationsDenied: 0 (0)
>             - NumScannerThreadsStarted: 23 (23)
>             - NumScannersWithNoReads: 12 (12)
>             - NumStatsFilteredPages: 4,032 (4032)
>             - NumStatsFilteredRowGroups: 1 (1)
>             - PeakMemoryUsage: 4.1 GiB (4431745197)
>             - PeakScannerThreadConcurrency: 23 (23)
>             - PerReadThreadRawHdfsThroughput: 842.1 MiB/s (882954163)
>             - RemoteScanRanges: 11 (11)
>             - RowBatchBytesEnqueued: 1.1 GiB (1221333486)
>             - RowBatchQueueGetWaitTime: 1.83s (1833499080)
>             - RowBatchQueuePeakMemoryUsage: 599.3 MiB (628430704)
>             - RowBatchQueuePutWaitTime: 1ms (1579356)
>             - RowBatchesEnqueued: 124 (124)
>             - RowsRead: 2,725,888 (2725888)
>             - RowsReturned: 0 (0){code}
>  
> Based on these counters, I assume following scenario happened:
>  # The concurrent scanner thread count peak at 23 (NumScannerThreadsStarted, 
> PeakScannerThreadConcurrency).
>  # Scanner node seems to try schedule the 24th thread, but backend denies it, 
> as indicated by NumScannerThreadMemUnavailable=1. 
>  # The running threads has been producing output row batches 
> (RowBatchesEnqueued=124), but the next exec node above it has not fetch any 
> yet (RowsReturned=0). So active scanner threads has been consuming its memory 
> reservation, including decompression activity that is happening in 
> [parquet-column-chunk-reader.cc|https://github.com/apache/impala/blob/df42225/be/src/exec/parquet/parquet-column-chunk-reader.cc#L155-L177].
>  # Just before the scanner node failed, it has consume Reservation=3.30 GB  
> and OtherMemory=849.59 MB. So per thread is around Reservation=146.92 MB and 
> OtherMemory=36.94 MB. This is close, but slightly higher, to planner initial 
> mem-reservation=128.00 MB for scanner node and 32 MB of 
> [hdfs_scanner_thread_max_estimated_bytes|https://github.com/apache/impala/blob/df42225/be/src/exec/hdfs-scan-node.cc#L57-L63]
>  for decompression usage per thread.
> Note that the 32 MB of hdfs_scanner_thread_max_estimated_bytes is a 
> non-reserved bytes. Meaning, they only allocated as needed during column 
> chunk decompression, but we think that in most cases they wont require more 
> than 32 MB.
> From these insight, I'm suspecting that when scanner node schedule the 23rd 
> thread, the memory reservation left was just barely fit the per-thread 
> consumption estimate (128.00 MB + 32 MB), and the backend allow it to start. 
> As the decompression process goes, one of the scanner thread tried to 
> allocate more memory than what is left in reservation at 
> ParquetColumnChunkReader::InitDictionary(). If the 23rd thread was not 
> launched, we might have enough memory to serve decompression requirement.
> One solution to avoid this OOM is to change our per-thread memory estimation 
> in 
> [scanner-mem-limiter.cc|https://github.com/apache/impala/blob/df42225/be/src/runtime/scanner-mem-limiter.cc#L59].
>  Maybe we should deny reservation once memory spare capacity can not fit 2 
> threads allocation consecutively (ie., always leave headroom of 1 thread 
> allocation).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to