Re: [ovs-dev] [PATCH v7 2/3] dpif-netdev: Detailed performance stats for PMDs

Jan Scheurich Fri, 19 Jan 2018 08:24:01 -0800

> >> We are iterating over a sorted list of pmd pointers from cmap dp-
> >>> poll_threads under the protection of the global dp_netdev_mutex lock.
> >> Each pmd struct in this cmap (except for the NON_PMD_CORE_ID) is running
> >> a pmd thread because the thread is created immediately after inserting the
> >> pmd into dp->poll_threads in reconfigure_pmd_threads() and ended
> >> immediately before removing the pmd from that cmap in
> >> dp_netdev_del_pmd(). Can either of this happen while the
> >> dp_netdev_mutext is held. If not then there is not even a potential race
> >> condition.
> >>
> >> So I think the we can be sure that the pmd thread is running when calling
> >> pmd_perf_stats_clear() here. Do you agree?
> 
> The main issue here is that PMD thread could sleep and never call the
> pmd_perf_start_iteration(). PMD thread could wait for reload inside the
> following loop if it has no rx queues to poll:
> 
> 4258     if (!poll_cnt) {
> 4259         while (seq_read(pmd->reload_seq) == pmd->last_reload_seq) {
> 4260             seq_wait(pmd->reload_seq, pmd->last_reload_seq);
> 4261             poll_block();
> 4262         }
> 4263         lc = UINT_MAX;
> 4264     }
> 
> And the main thread will wait forever.
>


Right, I missed that possibility.

> 
> About volatiles and atomics:
> 'volatile' only forces compiler to use actual loads/stores. But it doesn't
> protect from operations' reordering by compiler and definitely not from memory
> operations' reordering on cpu.
> 
> lets look at pmd_perf_stats_clear__() function:
> 
> 365 void
> 366 pmd_perf_stats_clear__(struct pmd_perf_stats *s)
> 367 {
> 368     for (int i = 0; i < PMD_N_STATS; i++) {
> 369         atomic_read_relaxed(&s->counters.n[i], &s->counters.zero[i]);
> 370     }
> 371     memset(&s->current, 0 , sizeof(struct iter_stats));
> 372     memset(&s->totals, 0 , sizeof(struct iter_stats));
> 373     histogram_clear(&s->cycles);
> 374     histogram_clear(&s->pkts);
> 375     histogram_clear(&s->cycles_per_pkt);
> 376     histogram_clear(&s->upcalls);
> 377     histogram_clear(&s->cycles_per_upcall);
> 378     histogram_clear(&s->pkts_per_batch);
> 379     histogram_clear(&s->max_vhost_qfill);
> 380     history_init(&s->iterations);
> 381     history_init(&s->milliseconds);
> 382     s->start_ms = time_msec();
> 383     s->milliseconds.sample[0].timestamp = s->start_ms;
> 384     s->log_begin_it = UINT64_MAX;
> 385     s->log_end_it = UINT64_MAX;
> 386     /* Clearing finished. */
> 387     s->clear = false;
> 388 }
> 
> Lets assume that 'clear' is volatile.
> 's->clear = false;' is the last operation in the code but compiler is able to
> move it anywhere in this function. This will break the logic.
> To avoid this reordering compiler barrier required, but it will not help
> with CPU reordering. In this case for non-x86 architectures real smp write
> barrier required.
> Mutex will be much better solution.
> 
> Lets assume that 'clear' is atomic.
> Atomic will provide protection from compiler reordering just like volatile.
> But, to prevent reordering on CPU we'll have to use Release-Acquire memory
> ordering for atomic operations. Isn't it already a mutex?
> 
> Conclusion:
> Since you want to protect your shared data from access by other threads you
> have to use mutex or not protect it at all (atomics and volatiles are
> effectively useless for this use-case unless you're combing them with memory
> barriers or heavy memory ordering qualifiers).
> In fact that you reading stats directly (non-atomically and without using
> volatile access) you will never have accurate data. From that point of view
> trying to protect data clearing just to have a bit more accurate stats-show
> output looks strange.
> 
> Some reading:
> https://www.kernel.org/doc/Documentation/process/volatile-considered-harmful.rst

Thanks, I see clearer now.

We don't need 100% accurate statistics, we just need to make sure statistics 
are not corrupted completely. 

Billy correctly pointed out that clearing the new PMD metrics (which do not 
have zero offsets) from outside the PMD thread has a risk for corruption 
because the PMD concurrently updating a histogram bin counter may overwrite a 
just cleared bin with the incremented value.

So we need to clear the new metrics from within the PMD if the PMD is actually 
running. Perhaps we don't have to make sure we block the pmd-stats-clear thread 
until the data is actually cleared, but we must make sure we do clear the data 
in all situations.

What if we protected the pmd_perf_stats with a mutex for clearing and treated 
the entire "for(;;) {...}" loop of the pmd thread as a critical section, 
letting the pmd seize the mutex while polling.

The main thread would try to grab that mutex in pmd_stats_clear(). If it 
succeeds it can clear the stats directly before releasing the mutex to prevent 
the pmd from starting to poll. 

If not, it would just set the pmd->clear flag and return after a short delay 
without synchronization. Or, if we want to block the pmd-stats-clear thread, we 
could use another pthread_cond to let the pmd thread signal the completion of 
clearing safely (as done for reload).

> 
> 
> About performance impact of relaxed atomic operations:
> On most modern architectures relaxed load/store operations on 32bit integers
> on aligned memory will simply become usual operations without any memory 
> fences.
> This means that rcu_get operation like get_vid() will cost nothing.

> IMHO, it's clear that we should use atomic_bool for 'pmd_perf_metrics' to 
> avoid
> any issues with strange architectures. Compiler will make all the dirty work 
> for us
> without performance impact in general case.

OK, will keep that.

BR, Jan
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v7 2/3] dpif-netdev: Detailed performance stats for PMDs

Reply via email to