On Fri, Jul 13, 2012 at 12:23 AM, Jakub Jermar <[email protected]> wrote:
> I noticed we now have conflicting implementations of
> ipi_wait_for_idle(). Moreover you call yours before sending the IPI
> while the mainline implementation calls it afterwards:
>
> +int l_apic_send_custom_ipi(uint8_t apicid, uint8_t vector)
> +{
> +       icr_t icr;
> +
> +       /* Wait for a destination cpu to accept our previous ipi. */
> +       ipi_wait_for_idle();
>
> I guess doing it before can eliminate some waiting time by simply doing
> some useful work before sending the next IPI, but on the other hand, you
> proceed with uncertainty that the IPI has indeed been delivered when you
> needed it.

If I understood the documentation [1] correctly, the
delivery status is "pending" until the destination cpu
accepts the interrupt. That just means the destination
cpu's APIC noted the IPI. Therefore, it does not indicate
the IPI was dispatched to software and it definitely
does not mean that software handled the IPI.

As far as I can see, it is only used to avoid sending
another IPI while sending the previous one is still
in progress. The documentation does not specify what
happens if we try to send another IPI and the previous
has not yet been accepted but I am guessing bad things
might happen, eg the previous IPI may be lost or the
current one discarded.

The relevant parts of the docs follow. Read only if
interested.

In subsection 10.6.1 of "ISSUING INTERPROCESSOR INTERRUPTS"
"0 (Idle) Indicates that this local APIC has completed
sending any previous IPIs.

1 (Send Pending) Indicates that this local APIC has not
completed sending the last IPI."

I think they mean "accepted" when stating "completed sending"
because in 10.5.1 "Local Vector Table" they explain:
"0 (Idle)
There is currently no activity for this inter-
rupt source, or the previous interrupt from
this source was delivered to the processor
core and accepted.
1 (Send Pending)
Indicates that an interrupt from this source
has been delivered to the processor core but
has not yet been accepted"

So what is accepting an interrupt?
"10.8.4 Interrupt Acceptance for Fixed Interrupts

The IRR contains the active interrupt requests that have been
accepted, but not yet
dispatched to the processor for servicing. When the local APIC accepts
an interrupt,
it sets the bit in the IRR that corresponds the vector of the accepted
interrupt. When
the processor core is ready to handle the next interrupt, the local
APIC clears the
highest priority IRR bit that is set and sets the corresponding ISR
bit. The vector for
the highest priority bit set in the ISR is then dispatched to the
processor core for
servicing.
[..]
If more than one interrupt is generated with the same vector number,
the local APIC
can set the bit for the vector both in the IRR and the ISR. This means
that for the
Pentium 4 and Intel Xeon processors, the IRR and ISR can queue two
interrupts for
each interrupt vector: one in the IRR and one in the ISR. Any
additional interrupts
issued for the same interrupt vector are collapsed into the single bit
in the IRR.

For the P6 family and Pentium processors, the IRR and ISR registers can queue no
more than two interrupts per interrupt vector and will reject other
interrupts that
are received within the same vector
"
Rejecting an IPI leads to sending a retry message to
the source APIC. I assume the source APIC will retry
automatically without needing software intervention.

>
> In l_apic_broadcast_custom_ipi(), you have:
>
> +       if (CPU->arch.id != l_apic_id()) {
> +#ifdef CONFIG_DEBUG
> +               printf("lapic error: LAPIC ID (%" PRIu8 ") and hw ID assigned 
> by BSP"
> +                       " (%u) differ. Correcting to LAPIC ID.\n", 
> l_apic_id(),
> +                       CPU->arch.id);
> +#endif
> +               CPU->arch.id = l_apic_id();
> +       }
>
> Can this really happen? If not, why bother with cpu_arch_id_init()?

You are right. I will remove the check.

I added it at first because arch.id is populated by MADT
or MPS (?) and not by asking the local APIC for its id
directly. I am not familiar with the other sources of the
id so I did not trust them fully :-).

> I noticed two (IMHO) antagonistic changes when it comes to dealing with
> latency. First, you started to prefer the CPU on which a thread last ran
> when readying the thread again after it was sleeping. This may
> questionably improve cache utilization (but the thread may have been
> sleeping for quite some time),

It also helps maintain a steady load. If a thread wakes up
many threads at once (eg condvar broadcast) they would all
be migrated by the wake up to the local cpu. The distribution
of threads among the cpus is then governed by thread_ready
and not by the load balancing threads.

> but worsens the wakeup latency for most
> threads.

Hmm, I am not sure I see why. Would not the thread just
woken up still have to wait for the current thread's time
slice to elapse -- be it on the local cpu or on another
cpu?

> Second, you started to compensate for preemptions that could
> not be realized when a thread temporarily disabled preemption.
> Can you elaborate on what was your motivation for these changes?

I assume you are referring to my failed attempt (see
r1551 [2]) at preempting a thread in preemption_enable()
if its time slice ran out while it had preemption disabled.

Firstly, I thought it was the right thing to do. If a
thread's time slice runs out, it should be rescheduled
as soon as possible. Moreover, if a thread runs mostly
with preemption disabled (eg like my test threads ;-))
it may noticably prolong its time slice. We are also
missing preemption opportunities every time the kernel
is holding a spinlock.

Secondly, I would like to be able to preempt the current
thread if a higher priorty thread (ie lower thread->priority)
becomes ready. If disabled preemption was all that it
took to ignore such a request, it would not make much
sense to do in the first place.

So why am I concerned with latency? The RCU algorithm
relies on the scheduler to actually run it. If RCU
callbacks arrive at a low rate it does not really matter
what the scheduler does. However what to do if there
is a sudden burst of RCU callbacks or even worse if
the callbacks are arriving at a high and steady rate?
RCU callbacks typically free memory, so if the scheduler
does not run the RCU threads, the memory is not freed
in a timely fashion and the system may run out of memory.

Now, RCU is not supposed to be receiving callbacks at
a high rate and if it is, RCU is being used incorrectly.
However I would still like to handle this situation
gracefully.

One option is not to do anything about it and let the
system exhaust its memory. Eventually, callbacks will
stop arriving at such a high rate because there simply
won't be any more memory to free (all mem will be pending
a free() in the callbacks). As a result, even if the
scheduler does not run the callback processing threads
(aka reclaimers) very often or predictably, they will
eventually process the callbacks (until enough memory
is freed for the rate to increase again).

The other option is to start processing the callbacks
agressively and try to keep up with the high arrival
rate even at the expense of slowing down the system
(instead of grinding to a halt as in the previous
option). In order to do this, we need:
1) the detector thread to actually run and to detect
and announce the end of a grace period.
2) the reclaimer threads to be woken up quickly after
the grace period ends/is announced.
3) reclaimer threads to be given cpu time proportionate
to the number of callbacks that must process.
4) not allow threads to queue too many callbacks in one
time slice.

(1) can be tackled by letting higher priority threads
(lower thread->priority) immediately preempt lower
priority running threads and fixing detectors priority
so it runs exlusively if it has work to do (that is ok
since it sleeps almost exlusively).

(2) another example where short wake up latency comes
in handy.

(3) maybe fixing the reclaimers' priority at 0 ? It
might not be enough if there are many threads producing
callbacks on the cpu. Or if a thread produces callbacks
that are more work to process that to produce.

(4), solving (2) with (3) may prove (4) irrelevant.

I did a bit of testing. (4) is not an issue (we queue
at most ~10k callbacks per time slice). I have not
tested for (1). (2) is difficult to gauge since it
depends on (1).

(3) is hurting the most. If I fix reclaimers' priority
at 0 they are still overwhelmed by arriving callbacks
if it takes less time to post the callbacks than to
process them. If I execute callbacks with preemption
disabled, overall throughput (subjectively) decreases
a bit, but there are at most 30k callbacks processed
in one go at any time. That is a major improvement
considering that with enabled preemption some reclaimers
are unlucky enough to gather 500k callbacks in a single
batch, ie half of all callbacks in the test! 500k
callbacks translates to a lower estimate of 16 B * 5 * 10^5
== 4 MB of outstanding memory per CPU and growing.

Hmm, I did not initially want to have preemption disabled
while executing callbacks but it may be an easy way
to protect agains exhausting memory without having
to tinker with the scheduler.

>
> Shouldn't waitq_complete_wakeup() be an integral part of waitq_sleep()?
> Using it separately means that the caller needs to know how was the
> waitq allocated. I think that we've had cases when the waitq was part of
> a structure which itself was sometimes allocated on the stack and
> sometimes dynamically. Good catch, btw.

That is a good idea.

Adam

[1] intel's vol 3: system programming guide, chapter 10:
        advanced programmable interrupt controller
[2] http://bazaar.launchpad.net/~adam-hraska+lp/helenos/rcu/revision/1551

_______________________________________________
HelenOS-devel mailing list
[email protected]
http://lists.modry.cz/cgi-bin/listinfo/helenos-devel

Reply via email to