[
https://issues.apache.org/jira/browse/IMPALA-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405489#comment-17405489
]
Wenzhe Zhou edited comment on IMPALA-10342 at 8/26/21, 10:21 PM:
-----------------------------------------------------------------
Saw same issue in one customer case. In that case, warning messages came from
two lines of code in Frontend flooded the impala.long.WARNING files and caused
coordinator crashing when generating runtime profile. The calling stacks are
same as the stacks reported in this Jira.
Could we add rate limit to suppress warning messages? We cannot turn off
warning messages, but adding rate limit could avoid flooding by warning
messages came from a few places.
was (Author: wzhou):
Saw same issue in one customer case. In that case, two warning messages from
frontend flooded the impala.long.WARNING files and caused coordinator crashing
when generating runtime profile. The calling stacks are same as the stacks
reported in this Jira.
Could we add rate limit to suppress warning messages?
> Flooding of UDF warnings crash the coordinator
> ----------------------------------------------
>
> Key: IMPALA-10342
> URL: https://issues.apache.org/jira/browse/IMPALA-10342
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Fifteen
> Assignee: Fifteen
> Priority: Minor
> Attachments: image-2020-11-19-17-30-22-918.png,
> image-2020-11-23-09-57-49-840.png, image-2021-04-28-20-20-45-798.png,
> impalad-ram-profile.pdf
>
>
> Hi, when encounting error, both `get_json_object()` and
> `DecimalOperators::IntToDecimalVal` will raise warning.
> During to their stateless nature, The warning flood will easily overwhelm
> cluster's processing capacity.
> To be specific, we have observed these bottlenecks:
> *Exchange Receiver*: the default value for `rpc_max_message_size` is 50MB.
> The flooding warning messages carried by ReportExecStatusPB may exceed that
> limit, causing profile-less status report. Or, if the report message size is
> somehow under the limit, the bandwidth consumption is also non-trivial.
> *Storage:* like IMPALA-5256 , flooding warnings produce huge log files since
> `stdout/stderr` won't be redirected when glog is rolling logs. Under this
> circumstance, we had enough of clearing log files and restarting executors.
> *Coordinator*: runtime profiles will be serialized to thrift and stored in
> Coordinator's memory. The warning flood will make `Untracked Memory` rising
> rapidly. I have made a heap profile(with pprof) and found most memory were
> used by RuntimeProfile and Strings.
> !image-2020-11-23-09-57-49-840.png!
>
> *1 preliminary Solution:*
> We suffered a lot from this problem, and we have came out with an preliminary
> solution.
> # We have a straightforward solution by muting the AddWarning()
> # Introduced a query option to re-enable the warning when needed.
> *Testing:*
> With muted warning messages, we find the burden of C nodes is highly
> alleviated and heap profiles no longer bound to RuntimeProfile.
>
> *Update*
> Encountered a similar crash case with `get_json_object()` query, each time
> the query submitted, the Coordinator crashes.
> !image-2021-04-28-20-20-45-798.png!
> Log:
> {code:java}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x0000000002c64dca, pid=3633220, tid=0x00007eff73308700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_181-b13) (build
> 1.8.0_181-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.181-b13 mixed mode
> linux-amd64 )
> # Problematic frame:
> # C [impalad+0x2864dca]
> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
> unsigned long, int)+0x13a
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /run/cloudera-scm-agent/process/10376-impala-IMPALAD/hs_err_pid3633220.log
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.java.com/bugreport/crash.jsp
> # The crash happened outside the Java Virtual Machine in native code.
> # See problematic frame for where to report the bug.
> #
> d. The connection had 2 associated session(s).
> I0427 13:43:03.907536 3853145 status.cc:126] Couldn't serialize thrift object:
> std::bad_alloc
> @ 0xbf4ef9
> @ 0x1352d5f
> @ 0x1352eaf
> @ 0x11986de
> @ 0x122516c
> @ 0x1225515
> @ 0x137ee36
> @ 0x13801a0
> @ 0x139682f
> @ 0x139915a
> @ 0x1399784
> @ 0x7f34791e0e24
> @ 0x7f3475dd835c
> {code}
> StackTrace:
> {code:java}
> Stack: [0x00007eff72b08000,0x00007eff73309000], sp=0x00007eff733006b0, free
> space=8161k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
> code)
> C [impalad+0x2864dca]
> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
> unsigned long, int)+0x13a
> C [impalad+0x286519f] tcmalloc::ThreadCache::Scavenge()+0x3f
> C [impalad+0x29a211a] operator delete(void*)+0x32a
> C [impalad+0xae94d9]
> impala::TRuntimeProfileNode::~TRuntimeProfileNode()+0x289
> C [impalad+0xae4987]
> impala::TRuntimeProfileTree::~TRuntimeProfileTree()+0x47
> C [impalad+0xf5280a] impala::RuntimeProfile::Compress(std::vector<unsigned
> char, std::allocator<unsigned char> >*) const+0x3aa
> C [impalad+0xf52eb0]
> impala::RuntimeProfile::SerializeToArchiveString(std::basic_stringstream<char,
> std::char_traits<char>, std::allocator<char> >*) const+0x40
> C [impalad+0xd986df]
> impala::ImpalaServer::GetRuntimeProfileOutput(impala::TUniqueId const&,
> std::string const&, impala::TRuntimeProfileFormat::type,
> std::basic_stringstream<char, std::char_traits<char>, std::allocator<char>
> >*, impala::TRuntimeProfileTree*,
> rapidjson::GenericDocument<rapidjson::UTF8<char>,
> rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>,
> rapidjson::CrtAllocator>*)+0x5bf
> C [impalad+0xe2516d]
> impala::ImpalaHttpHandler::QueryProfileHelper(kudu::WebCallbackRegistry::WebRequest
> const&, rapidjson::GenericDocument<rapidjson::UTF8<char>,
> rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>,
> rapidjson::CrtAllocator>*, impala::TRuntimeProfileFormat::type)+0x4ed
> C [impalad+0xe25516]
> impala::ImpalaHttpHandler::QueryProfileEncodedHandler(kudu::WebCallbackRegistry::WebRequest
> const&, rapidjson::GenericDocument<rapidjson::UTF8<char>,
> rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>,
> rapidjson::CrtAllocator>*)+0x16
> C [impalad+0xf7ee37] impala::Webserver::RenderUrlWithTemplate(sq_connection
> const*, kudu::WebCallbackRegistry::WebRequest const&,
> impala::Webserver::UrlHandler const&, std::basic_stringstream<char,
> std::char_traits<char>, std::allocator<char> >*, impala::ContentType*)+0x177
> C [impalad+0xf801a1]
> impala::Webserver::BeginRequestCallback(sq_connection*,
> sq_request_info*)+0x951
> C [impalad+0xf96830] kudu::StringGauge::~StringGauge()+0x100
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]