[ 
https://issues.apache.org/jira/browse/IMPALA-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173940#comment-17173940
 ] 

Aman Sinha commented on IMPALA-10063:
-------------------------------------

I added the counters to be dumped to the log when DCHECK is encountered..here 
is the output:
{noformat}
system-state-info.cc:146] Check failed: total_tics > 0 (-5824961 vs. 0) 
cur_sum: 5370721, old_sum: 11195682
{noformat}

I looked at the code and the /proc/stat output to understand why the current 
snapshot of the cpu counter is lower than the old snapshot.  The relevant code 
is as follows:
{noformat}
  /// The enum names correspond to the fields of /proc/stat.
  enum PROC_STAT_CPU_VALUES {
    CPU_USER = 0,
    CPU_NICE,
    CPU_SYSTEM,
    CPU_IDLE,
    CPU_IOWAIT,
    NUM_CPU_VALUES
  };
...

  const CpuValues& cur = cpu_values_[cpu_val_idx_];
  const CpuValues& old = cpu_values_[1 - cpu_val_idx_];

  // Sum up all counters
  int64_t cur_sum = accumulate(cur.begin(), cur.end(), 0L);
  int64_t old_sum = accumulate(old.begin(), old.end(), 0L);

  int64_t total_tics = cur_sum - old_sum;
{noformat}

The /proc/stat output (truncated) is as follows:
{noformat}
$ cat /proc/stat
cpu  2267302 4091 754132 4274554 2607 0 156861 0 0 0
...
{noformat}

The 'accumulate()' call above adds up the first 5 columns (user, nice, system, 
idle, iowait).  Since these counters are supposedly increasing with time (they 
measure the ticks since the last reboot), each snapshot is supposed to be 
greater than the last one.   As an aside,  I am not sure why the idle and 
iowait time are added to the actual cpu usage timers. 

There are 2 possibilities: (1) on this machine for whatever reason the 
/proc/stat's  cpu counters are not getting updated correctly, or (2) within our 
code somehow the cpu_val_idx_ got swapped between the current and old.  This 
index is supposed to rotate from 0 to 1 and back to 0 when 
SystemStateInfo::ReadCurrentProcStat() is called.    I will have to add some 
more debug messages to narrow down.

> Intermittent crash seen during ComputeCpuRatios
> -----------------------------------------------
>
>                 Key: IMPALA-10063
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10063
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 3.4.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>            Priority: Major
>
> On my desktop running Ubuntu 18.04 I sometimes  (once in a few days) see 
> impalad hit a DCHECK and crash in ComputeCpuRatios with the following stack:
> {noformat}
> system-state-info.cc:146] Check failed: total_tics > 0 (-9802454 vs. 0) 
> *** Check failure stack trace: ***
>     @          0x5140a1c  google::LogMessage::Fail()
>     @          0x514230c  google::LogMessage::SendToLog()
>     @          0x514037a  google::LogMessage::Flush()
>     @          0x5143f78  google::LogMessageFatal::~LogMessageFatal()
>     @          0x266a4f7  impala::SystemStateInfo::ComputeCpuRatios()
>     @          0x2669c89  
> impala::SystemStateInfo::CaptureSystemStateSnapshot()
>     @          0x2193dae  
> _ZZN6impala7ExecEnv19InitSystemStateInfoEvENKUlvE_clEv
>     @          0x2194ab0  
> _ZNSt17_Function_handlerIFvvEZN6impala7ExecEnv19InitSystemStateInfoEvEUlvE_E9_M_invokeERKSt9_Any_data
>     @          0x26244af  std::function<>::operator()()
>     @          0x2622a97  impala::PeriodicCounterUpdater::UpdateLoop()
>     @          0x262dc10  boost::_mfi::mf0<>::operator()()
>     @          0x262db72  boost::_bi::list1<>::operator()<>()
>     @          0x262db1a  boost::_bi::bind_t<>::operator()()
>     @          0x262dadb  boost::detail::thread_data<>::run()
>     @          0x3e47771  thread_proxy
>     @     0x7f0a9bc6d6da  start_thread
>     @     0x7f0a986a5a3e  clone
> {noformat}
> Since the total_tics calculation is dependent on the system clock, on further 
> digging using the adjtimex utility, I found some odd  values for the offset 
> and frequency on the machine .. they are negative instead of positive:   
> [UPDATE: actually based on adjtimex man page, the negative value is 
> acceptable, so it is not clear if this could be a contributing factor or not]
> {noformat}
> $ adjtimex -p
>          mode: 0
>        offset: -1232794                       
>     frequency: -109214                      
>      maxerror: 236000
>      esterror: 0
>        status: 8193
> time_constant: 5
>     precision: 1
>     tolerance: 32768000
>          tick: 10000
>      raw time:  1596910192s 365781168us = 1596910192.365781168
> {noformat}
> Regardless of this, we should prevent it from crashing impala. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to