[
https://issues.apache.org/jira/browse/IMPALA-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173940#comment-17173940
]
Aman Sinha commented on IMPALA-10063:
-------------------------------------
I added the counters to be dumped to the log when DCHECK is encountered..here
is the output:
{noformat}
system-state-info.cc:146] Check failed: total_tics > 0 (-5824961 vs. 0)
cur_sum: 5370721, old_sum: 11195682
{noformat}
I looked at the code and the /proc/stat output to understand why the current
snapshot of the cpu counter is lower than the old snapshot. The relevant code
is as follows:
{noformat}
/// The enum names correspond to the fields of /proc/stat.
enum PROC_STAT_CPU_VALUES {
CPU_USER = 0,
CPU_NICE,
CPU_SYSTEM,
CPU_IDLE,
CPU_IOWAIT,
NUM_CPU_VALUES
};
...
const CpuValues& cur = cpu_values_[cpu_val_idx_];
const CpuValues& old = cpu_values_[1 - cpu_val_idx_];
// Sum up all counters
int64_t cur_sum = accumulate(cur.begin(), cur.end(), 0L);
int64_t old_sum = accumulate(old.begin(), old.end(), 0L);
int64_t total_tics = cur_sum - old_sum;
{noformat}
The /proc/stat output (truncated) is as follows:
{noformat}
$ cat /proc/stat
cpu 2267302 4091 754132 4274554 2607 0 156861 0 0 0
...
{noformat}
The 'accumulate()' call above adds up the first 5 columns (user, nice, system,
idle, iowait). Since these counters are supposedly increasing with time (they
measure the ticks since the last reboot), each snapshot is supposed to be
greater than the last one. As an aside, I am not sure why the idle and
iowait time are added to the actual cpu usage timers.
There are 2 possibilities: (1) on this machine for whatever reason the
/proc/stat's cpu counters are not getting updated correctly, or (2) within our
code somehow the cpu_val_idx_ got swapped between the current and old. This
index is supposed to rotate from 0 to 1 and back to 0 when
SystemStateInfo::ReadCurrentProcStat() is called. I will have to add some
more debug messages to narrow down.
> Intermittent crash seen during ComputeCpuRatios
> -----------------------------------------------
>
> Key: IMPALA-10063
> URL: https://issues.apache.org/jira/browse/IMPALA-10063
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 3.4.0
> Reporter: Aman Sinha
> Assignee: Aman Sinha
> Priority: Major
>
> On my desktop running Ubuntu 18.04 I sometimes (once in a few days) see
> impalad hit a DCHECK and crash in ComputeCpuRatios with the following stack:
> {noformat}
> system-state-info.cc:146] Check failed: total_tics > 0 (-9802454 vs. 0)
> *** Check failure stack trace: ***
> @ 0x5140a1c google::LogMessage::Fail()
> @ 0x514230c google::LogMessage::SendToLog()
> @ 0x514037a google::LogMessage::Flush()
> @ 0x5143f78 google::LogMessageFatal::~LogMessageFatal()
> @ 0x266a4f7 impala::SystemStateInfo::ComputeCpuRatios()
> @ 0x2669c89
> impala::SystemStateInfo::CaptureSystemStateSnapshot()
> @ 0x2193dae
> _ZZN6impala7ExecEnv19InitSystemStateInfoEvENKUlvE_clEv
> @ 0x2194ab0
> _ZNSt17_Function_handlerIFvvEZN6impala7ExecEnv19InitSystemStateInfoEvEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x26244af std::function<>::operator()()
> @ 0x2622a97 impala::PeriodicCounterUpdater::UpdateLoop()
> @ 0x262dc10 boost::_mfi::mf0<>::operator()()
> @ 0x262db72 boost::_bi::list1<>::operator()<>()
> @ 0x262db1a boost::_bi::bind_t<>::operator()()
> @ 0x262dadb boost::detail::thread_data<>::run()
> @ 0x3e47771 thread_proxy
> @ 0x7f0a9bc6d6da start_thread
> @ 0x7f0a986a5a3e clone
> {noformat}
> Since the total_tics calculation is dependent on the system clock, on further
> digging using the adjtimex utility, I found some odd values for the offset
> and frequency on the machine .. they are negative instead of positive:
> [UPDATE: actually based on adjtimex man page, the negative value is
> acceptable, so it is not clear if this could be a contributing factor or not]
> {noformat}
> $ adjtimex -p
> mode: 0
> offset: -1232794
> frequency: -109214
> maxerror: 236000
> esterror: 0
> status: 8193
> time_constant: 5
> precision: 1
> tolerance: 32768000
> tick: 10000
> raw time: 1596910192s 365781168us = 1596910192.365781168
> {noformat}
> Regardless of this, we should prevent it from crashing impala.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]