[jira] [Updated] (IMPALA-10342) A way to alleviate congestion caused by row-level warnings

Fifteen (Jira) Wed, 28 Apr 2021 05:24:04 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Fifteen updated IMPALA-10342:
-----------------------------
    Description: 
Hi, when encounting error, both `get_json_object()` and 
`DecimalOperators::IntToDecimalVal` will raise warning.

During to their stateless nature, The warning flood will easily overwhelm 
cluster's processing capacity.

To be specific, we have observed these bottlenecks:

*Exchange Receiver*:   the default value for `rpc_max_message_size` is 50MB. 
The flooding warning messages carried by ReportExecStatusPB may exceed that 
limit, causing profile-less status report. Or,  if the report message size is 
somehow under the limit, the bandwidth consumption is also non-trivial.

*Storage:* like IMPALA-5256 , flooding warnings produce huge log files since 
`stdout/stderr` won't be redirected when glog is rolling logs.  Under this 
circumstance, we had enough of clearing log files and restarting executors. 

*Coordinator*: runtime profiles will be serialized to thrift and stored in 
Coordinator's memory. The warning flood will make `Untracked Memory` rising 
rapidly. I have made a heap profile(with pprof) and found most memory were used 
by RuntimeProfile and Strings. 

  !image-2020-11-23-09-57-49-840.png!

 

*1 preliminary Solution:*

We suffered a lot from this problem, and we have came out with an preliminary 
solution. 
 # We have a straightforward solution by muting the AddWarning()
 # Introduced a query option to re-enable the warning when needed.

 *Testing:*

With muted warning messages, we find the burden of C nodes is highly alleviated 
and heap profiles no longer bound to RuntimeProfile.

 

Encountered a similar crash case with  `get_json_object()` query, each time the 
query submitted, the Coordinator crashes.

!image-2021-04-28-20-20-45-798.png!

 Crash info:

```

# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x0000000002c64dca, pid=3633220, tid=0x00007eff73308700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 
1.8.0_181-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 
)
# Problematic frame:
# C [impalad+0x2864dca] 
tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, 
unsigned long, int)+0x13a
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /run/cloudera-scm-agent/process/10376-impala-IMPALAD/hs_err_pid3633220.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
d. The connection had 2 associated session(s).
I0427 13:43:02.920620 3840090 impala-server.cc:1314] GetSessionState(): Invalid 
session id: e442483652f4c884:53795d98ca8811ad
I0427 13:43:03.589779 3856085 query-exec-mgr.cc:192] ReleaseQueryState(): 
deleted query_id=d542f9361df1e550:08c99b0000000000
I0427 13:43:03.729714 3634227 ImpaladCatalog.java:206] Received large catalog 
object(>100mb): TABLE:beacon_olap.t_od_light_olap_other_info is 326827833bytes
I0427 13:43:03.729748 3634227 ImpaladCatalog.java:209] Adding: 
TABLE:beacon_olap.t_od_light_olap_other_info version: 14104731 size: 326827833
I0427 13:43:03.749281 3634227 ImpaladCatalog.java:209] Adding: 
TABLE:beacon_olap.now_nowzb_ods_tencent_live_beacon_sdk_now_data_rt version: 
14104726 size: 4573393
I0427 13:43:03.907536 3853145 status.cc:126] Couldn't serialize thrift object:
std::bad_alloc
 @ 0xbf4ef9
 @ 0x1352d5f
 @ 0x1352eaf
 @ 0x11986de
 @ 0x122516c
 @ 0x1225515
 @ 0x137ee36
 @ 0x13801a0
 @ 0x139682f
 @ 0x139915a
 @ 0x1399784
 @ 0x7f34791e0e24
 @ 0x7f3475dd835c

```

 

  was:
Hi, when encounting error, both `get_json_object()` and 
`DecimalOperators::IntToDecimalVal` will raise warning.

During to their stateless nature, The warning flood will easily overwhelm 
cluster's processing capacity.

To be specific, we have observed these bottlenecks:

*Exchange Receiver*:   the default value for `rpc_max_message_size` is 50MB. 
The flooding warning messages carried by ReportExecStatusPB may exceed that 
limit, causing profile-less status report. Or,  if the report message size is 
somehow under the limit, the bandwidth consumption is also non-trivial.

*Storage:* like IMPALA-5256 , flooding warnings produce huge log files since 
`stdout/stderr` won't be redirected when glog is rolling logs.  Under this 
circumstance, we had enough of clearing log files and restarting executors. 

*Coordinator*: runtime profiles will be serialized to thrift and stored in 
Coordinator's memory. The warning flood will make `Untracked Memory` rising 
rapidly. I have made a heap profile(with pprof) and found most memory were used 
by RuntimeProfile and Strings. 

  !image-2020-11-23-09-57-49-840.png!

 

*1 preliminary Solution:*

We suffered a lot from this problem, and we have came out with an preliminary 
solution. 
 # We have a straightforward solution by muting the AddWarning()
 # Introduced a query option to re-enable the warning when needed.

 *Testing:*

With muted warning messages, we find the burden of C nodes is highly alleviated 
and heap profiles no longer bound to RuntimeProfile.

 

We are looking forward for a *better direction* from community, thanks~

 

Encountered a similar crash case with  `get_json_object()` query, each time the 
query submitted, the Coordinator crashes.

!image-2021-04-28-20-20-45-798.png!

 

 


> A way to alleviate congestion caused by row-level warnings 
> -----------------------------------------------------------
>
>                 Key: IMPALA-10342
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10342
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Fifteen
>            Priority: Minor
>         Attachments: image-2020-11-19-17-30-22-918.png, 
> image-2020-11-23-09-57-49-840.png, image-2021-04-28-20-20-45-798.png, 
> impalad-ram-profile.pdf
>
>
> Hi, when encounting error, both `get_json_object()` and 
> `DecimalOperators::IntToDecimalVal` will raise warning.
> During to their stateless nature, The warning flood will easily overwhelm 
> cluster's processing capacity.
> To be specific, we have observed these bottlenecks:
> *Exchange Receiver*:   the default value for `rpc_max_message_size` is 50MB. 
> The flooding warning messages carried by ReportExecStatusPB may exceed that 
> limit, causing profile-less status report. Or,  if the report message size is 
> somehow under the limit, the bandwidth consumption is also non-trivial.
> *Storage:* like IMPALA-5256 , flooding warnings produce huge log files since 
> `stdout/stderr` won't be redirected when glog is rolling logs.  Under this 
> circumstance, we had enough of clearing log files and restarting executors. 
> *Coordinator*: runtime profiles will be serialized to thrift and stored in 
> Coordinator's memory. The warning flood will make `Untracked Memory` rising 
> rapidly. I have made a heap profile(with pprof) and found most memory were 
> used by RuntimeProfile and Strings. 
>   !image-2020-11-23-09-57-49-840.png!
>  
> *1 preliminary Solution:*
> We suffered a lot from this problem, and we have came out with an preliminary 
> solution. 
>  # We have a straightforward solution by muting the AddWarning()
>  # Introduced a query option to re-enable the warning when needed.
>  *Testing:*
> With muted warning messages, we find the burden of C nodes is highly 
> alleviated and heap profiles no longer bound to RuntimeProfile.
>  
> Encountered a similar crash case with  `get_json_object()` query, each time 
> the query submitted, the Coordinator crashes.
> !image-2021-04-28-20-20-45-798.png!
>  Crash info:
> ```
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x0000000002c64dca, pid=3633220, tid=0x00007eff73308700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build 
> 1.8.0_181-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # C [impalad+0x2864dca] 
> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
>  unsigned long, int)+0x13a
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /run/cloudera-scm-agent/process/10376-impala-IMPALAD/hs_err_pid3633220.log
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.java.com/bugreport/crash.jsp
> # The crash happened outside the Java Virtual Machine in native code.
> # See problematic frame for where to report the bug.
> #
> d. The connection had 2 associated session(s).
> I0427 13:43:02.920620 3840090 impala-server.cc:1314] GetSessionState(): 
> Invalid session id: e442483652f4c884:53795d98ca8811ad
> I0427 13:43:03.589779 3856085 query-exec-mgr.cc:192] ReleaseQueryState(): 
> deleted query_id=d542f9361df1e550:08c99b0000000000
> I0427 13:43:03.729714 3634227 ImpaladCatalog.java:206] Received large catalog 
> object(>100mb): TABLE:beacon_olap.t_od_light_olap_other_info is 326827833bytes
> I0427 13:43:03.729748 3634227 ImpaladCatalog.java:209] Adding: 
> TABLE:beacon_olap.t_od_light_olap_other_info version: 14104731 size: 326827833
> I0427 13:43:03.749281 3634227 ImpaladCatalog.java:209] Adding: 
> TABLE:beacon_olap.now_nowzb_ods_tencent_live_beacon_sdk_now_data_rt version: 
> 14104726 size: 4573393
> I0427 13:43:03.907536 3853145 status.cc:126] Couldn't serialize thrift object:
> std::bad_alloc
>  @ 0xbf4ef9
>  @ 0x1352d5f
>  @ 0x1352eaf
>  @ 0x11986de
>  @ 0x122516c
>  @ 0x1225515
>  @ 0x137ee36
>  @ 0x13801a0
>  @ 0x139682f
>  @ 0x139915a
>  @ 0x1399784
>  @ 0x7f34791e0e24
>  @ 0x7f3475dd835c
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-10342) A way to alleviate congestion caused by row-level warnings

Reply via email to