we need to get rid of that null_legacy_pic. I think we're ready to see how
much that breaks. I turned it off because I was so confused, but we're
working our way to a patch free kernel and this is a good next step.

On Wed, Jun 22, 2016 at 11:20 AM Barret Rhoden <[email protected]> wrote:

> hi -
>
> so i looked into the mysteries around why our IO-APIC and virtio-cons
> works the way it does.  i did this mostly through printks and looking
> through the codebase.  with ssh and scp, the turn-around for a
> recompiled and rebooted linux vm was about 20 seconds.
>
> here goes:
>
> ------------------------------
>
> without the ISORs, linux thinks the rest of the IOAPIC pins are
> disconnected:
>
> (ISORs set for 1-4, for bus 0)
> source -> gsi
>      0 -> 2
>      1 -> 1
>      3 -> 3
>      4 -> 4
>
> sorted by gsi
>      1 -> 1
>      0 -> 2
>      3 -> 3
>      4 -> 4
>
>
> setup_IO_APIC_irqs():
>
> %ENABLING IO-APIC IRQs
> %init IO_APIC IRQs
> % apic 0 pin 0 not connected
>
> 1-4 here!  these must be IOAPIC output, not source irqs
>
> % apic 0 pin 5 not connected
> % apic 0 pin 6 not connected
> % apic 0 pin 7 not connected
> % apic 0 pin 8 not connected
> % apic 0 pin 9 not connected
> % apic 0 pin 10 not connected
> % apic 0 pin 11 not connected
> % apic 0 pin 12 not connected
> % apic 0 pin 13 not connected
> % apic 0 pin 14 not connected
> % apic 0 pin 15 not connected
> % apic 0 pin 16 not connected
> % apic 0 pin 17 not connected
> % apic 0 pin 18 not connected
> % apic 0 pin 19 not connected
> % apic 0 pin 20 not connected
> % apic 0 pin 21 not connected
> % apic 0 pin 22 not connected
> % apic 0 pin 23 not connected
>
> questions:
> -----------------------
> what specifically did linux see to think that?
>
>         find_irq_entry()
>
> is it something we're doing wrong?
>
> we didn't tell it about any PCI devices, or anything that might be
> connected there.
>
> we didn't set the MADT PC_COMPAT flag (tried it, had no effect)
>
> find_irq_entry in this case looks for type mp_INT.  that gets set in a
> few places.  we don't do mp tables, right?  so let's look at the ACPI
> places.  one of them:
>
>         mp_override_legacy_irq()
>
> happens 4 times, once for each ISOR.  sets mp_INT, ISA_BUS, and a few
> other things.
>
> the other place is
>
>         mp_config_acpi_legacy_irqs()
>
> which we call, but nr_legacy_irqs == 0, so we do nothing!
>
> why is it 0?  because our hacks hard-coded it that way:
>
> struct legacy_pic *legacy_pic = &null_legacy_pic; //&default_legacy_pic;
>
> k, so no legacy pic means no default routes for any of the ioapic pins.
>
> another place mp_INT is set is mpparse.c, but we don't call that
> location.  i didn't see any other places that set mp_INT.
>
> so it looks like the IOAPIC isn't connected to anything.  no legacy
> (thanks to us), thus no default 1:1 mappings.  then the only pins that
> are connected are what we explicitly told it via the ISORs.
>
> k, that answers that question.  we need to wire the IOAPIC somehow.
> now we told it that 4 pins are connected.  in our current
> setup, that is just:
>
> source -> gsi
>      0 -> 2
>      1 -> 1
>      3 -> 3
>      4 -> 4
>
> also, the variable 'pin' refers to the IOAPICs pins since we had pins
> 1-4 connected.  if pin referred to source_irq, then pin 2 would not be
> connected and pin 0 would be.  that wouldn't really make sense anyways,
> since the IOAPIC doesn't know about overrides.  it just knows its input
> pins and output vectors.  ACPI tells us how it is all wired.
>
>
> -------------------------
> why is it that IRQ 26 gets mapped to slot 3 (the fourth slot)?  given
> what we know now, the IOAPIC only has its pins 1-4 set.
>
> each of them gets pin_2_irq() (ret = the new IRQ):
>
> %XME pin_2_irq idx 1, pin 1, gsi 1, ret 24
> %XME pin_2_irq idx 0, pin 2, gsi 2, ret 25
> %XME pin_2_irq idx 2, pin 3, gsi 3, ret 26
> %XME pin_2_irq idx 3, pin 4, gsi 4, ret 27
>
> at first, i thought idx was the bus's source_irq from the ISORs.  but
> that's 0-3, not {0, 1, 3, 4}.  what the heck?
>
> idx comes from find_irq_entry().  let's look closer at the setup:
>
> %ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> %Int: type 0, pol 3, trig 1, bus 00, IRQ 00, APIC ID 0, APIC INT 02
>                   ^ this might be due to special casing of the timer
>
> %ACPI: INT_SRC_OVR (bus 0 bus_irq 1 global_irq 1 dfl dfl)
> %Int: type 0, pol 0, trig 0, bus 00, IRQ 01, APIC ID 0, APIC INT 01
>
> %ACPI: INT_SRC_OVR (bus 0 bus_irq 3 global_irq 3 dfl dfl)
> %Int: type 0, pol 0, trig 0, bus 00, IRQ 03, APIC ID 0, APIC INT 03
>
> %ACPI: INT_SRC_OVR (bus 0 bus_irq 4 global_irq 4 dfl dfl)
> %Int: type 0, pol 0, trig 0, bus 00, IRQ 04, APIC ID 0, APIC INT 04
>
> 'idx' is just the index in the mp_irqs[] array.  nothing to do with
> the actual overrides, which are fine.
>
> oh, and pin == GSI in pin_2_irq() because this IOAPIC's gsi base = 0.
>
> anyway, ret is the desc we get back, e.g. when virtio-cons does a
> request_irq.
>
> ret came from mp_map_pin_to_irq()
>         alloc_irq_from_domain()
>                 __irq_domain_alloc_irqs()
>                         virq = irq_domain_alloc_descs()
>                                 irq_alloc_descs_from, various macros:
>                                 __irq_alloc_descs()
>
> in there, we have this block:
>
>         /*
>          * For interrupts which are freely allocated the
>          * architecture can force a lower bound to the @from
>          * argument. x86 uses this to exclude the GSI space.
>          */
>         from = arch_dynirq_lower_bound(from);
>
> that ends up setting the lowest IRQ to 24.  from there, we just
> allocate whichever irq is free, starting at 24.  24, then 25, then 26,
> etc.
>
> that 24 came from:
>         return ioapic_initialized ? ioapic_dynirq_base : gsi_top;
>
> and it was gsi_top, which is set in mp_register_ioapic(), where it
> looks like it is set to the IOAPIC's pins + GSI base + 1, so basically
> the next IRQ available.
>
> also in mp_map_pin_to_irq(), it looks like there is a case for
> legacy/non-PCI IRQs, where we call alloc_isa_irq_from_domain() instead
> of alloc_irq_from_domain().  (this is based on the nr_legacy_irqs,
> which we set to 0 by our hack).
>
>
> so anyway, here's what happens (probably):
> -----------------------
> 1) we tell linux it has an ioapic, but it was hacked to not think it
> was a legacy ioapic, thus it has no default pins.
>
> 2) we give it ISORs, so it sets up *those* pins on the IOAPIC.  that's
> all the devices that linux knows about.
>
> 3) when the IOAPIC inits, it needs Linux IRQs for each of the IOAPIC
> pins.  it calls mp_map_pin_to_irq.
>
> 4) since we hacked out the legacy support, those IOAPIC pins aren't
> given their "usual" IRQ.  they are now given IRQs starting at 24.
> linux thinks those legacy GSIs are busy, and won't hand them out to
> non-legacy interrupts
>
> 5) when linux tries to set the vector for these IRQs, it points back to
> its original IOAPIC pin.
>
> for example, IRQ 26 (virtio-cons) is IO-APIC 0, index 2
>
> %XME pin_2_irq idx 2, pin 3, gsi 3, ret 26
>
> recall that this was the third override (number 2), not the
> source_irq.  so this override was source 3 -> pin 3.  not that source
> matters much, you want to mess with the IOAPIC pin 3 regardless.  the
> ISOR just said that "what you thought was at source is wired to IOAPIC
> pin destination".
>
> so when linux tries to set the vector, it goes to the IOREDTBL entry
> for pin3.  AKA, 0x16 in the IOAPIC space.
>
> the only reason virtio-cons got 26 was that we told it to use it.  it
> wasn't a global system irq.  it happened to be the number that linux
> assigned internally for the ISOR IRQs we gave it.  basically, we tell
> linux it has an IOAPIC pin set by having an override with global_irq =
> the pin for that IOAPIC.  (e.g., global_irq = 0 means pin 0 is set for
> IOAPIC 0).  then we lined up what we told virtio-cons so that it would
> be tricked into using the IOAPIC for pin 0.  for example, this patch
> works for me: (just one override for 0, and subtract 24, not 23).
>
> diff --git a/tests/vmm/vmrunkernel.c b/tests/vmm/vmrunkernel.c
> index d00c441a7ae6..c043e8defdd4 100644
> --- a/tests/vmm/vmrunkernel.c
> +++ b/tests/vmm/vmrunkernel.c
> @@ -116,16 +116,7 @@ struct acpi_madt_interrupt_override isor[] = {
>          */
>         {.header = {.type = ACPI_MADT_TYPE_INTERRUPT_OVERRIDE,
>                     .length = sizeof(struct acpi_madt_interrupt_override)},
> -        .bus = 0, .source_irq = 0, .global_irq = 2, .inti_flags = 0},
> -       {.header = {.type = ACPI_MADT_TYPE_INTERRUPT_OVERRIDE,
> -                   .length = sizeof(struct acpi_madt_interrupt_override)},
> -        .bus = 0, .source_irq = 1, .global_irq = 1, .inti_flags = 0},
> -       {.header = {.type = ACPI_MADT_TYPE_INTERRUPT_OVERRIDE,
> -                   .length = sizeof(struct acpi_madt_interrupt_override)},
> -        .bus = 0, .source_irq = 3, .global_irq = 3, .inti_flags = 0},
> -       {.header = {.type = ACPI_MADT_TYPE_INTERRUPT_OVERRIDE,
> -                   .length = sizeof(struct acpi_madt_interrupt_override)},
> -        .bus = 0, .source_irq = 4, .global_irq = 4, .inti_flags = 0},
> +        .bus = 0, .source_irq = 0, .global_irq = 0, .inti_flags = 0},
>  };
>
>
> @@ -195,7 +186,7 @@ static struct virtio_mmio_dev cons_mmio_dev = {
>         .poke_guest = virtio_poke_guest,
>         /* At the moment irq numbers cannot be below 24; this is a problem
> with
>          * the IOAPIC and Interrupt Source Override Structure. */
> -       .irq = 26,
> +       .irq = 24,
>  };
>
>  static struct virtio_console_config cons_cfg;
> diff --git a/user/vmm/ioapic.c b/user/vmm/ioapic.c
> index 3f6d58caad6a..c515d067bf7f 100644
> --- a/user/vmm/ioapic.c
> +++ b/user/vmm/ioapic.c
> @@ -110,7 +110,7 @@ static void ioapic_write(struct guest_thread
> *vm_thread, int ix,
>                 /* The first IRQ register starts at 0x10, and there are
> two 32-bit
>                  * registers for each IRQ. The first 8 bits of the value
> assigned to
>                  * 'reg' is the interrupt vector. */
> -               irqreg = (vm->virtio_mmio_devices[i]->irq -
> IOAPIC_LAST_IRQ) * 2 + 0x10;
> +               irqreg = (vm->virtio_mmio_devices[i]->irq - 24) * 2 + 0x10;
>                 if (reg == irqreg && (value & 0xff) != 0) {
>                         vm->virtio_mmio_devices[i]->vec = value & 0xff;
>                         DPRINTF("irq vector for irq number %d is: %lx\n",
>
>
> this seems super screwy, and will break if / when we change that
> null_legacy_pic business.
>
> say for instance that we actually supported the legacy pic.  what is
> the appropriate way to express IRQ vectors for virtio?  my guess is
> virtio-pci would use the usual PCI irq range (16-23), and since we are
> using legacy irqs, that would get GSI == IRQ (for GSI < 24).  unlike
> what we have now, where GSI == 0 and IRQ == 26.
>
> what does virtio-mmio do for interrupt/vectors?
>
> http://lxr.free-electrons.com/source/Documentation/virtual/virtio-spec.txt?v=3.8
> (grep mmio) has some stuff, but seems to default to PCI for things not
> listed in Appendix X.  perhaps that includes IRQ/vector setup.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Akaros" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to