On Monday 02 June 2008 19:27:31 Anthony Liguori wrote:
> Amit Shah wrote:

> > We can assign a device from the host machine to a guest. The original
> > code comes from Neocleus.
> >
> > A new command-line option, -pcidevice is added.
> > For example, to invoke it for an Ethernet device sitting at
> > PCI bus:dev.fn 04:08.0 with host IRQ 18, use this:
> >
> > -pcidevice Ethernet/04:08.0-18
>
> Why is it necessary to specify the device type and the IRQ?  Shouldn't
> that all be discoverable?

The first string isn't the device type, just some string passed to identify it 
in the debug logs.

The IRQ is needed for the irqhook module. Currently, we use it when using the 
in-kernel irqchip as well, but that should be replaced soon.

> > The host ethernet driver is to be removed before doing the passthrough.
> >
> > If kvm uses the in-kernel irqchip, interrupts are routed to
> > the guest via the kvm module (accompanied kernel changes are necessary).
> > If -no-kvm-irqchip is used, the 'irqhook' module, also included here,
> > is to be used.
>
> We should drop the irqhook module entirely.  It seems unlikely that it
> will be received well on LKML.

It's not planned for submission. It's needed by a few for some cases and it's 
going to be around only as a fallback.

> > +#ifdef KVM_CAP_PCI_PASSTHROUGH
> > +/*!
> > + * \brief Notifies host kernel about assigning a PCI device to the guest
> > + *
> > + * Used for PCI passthrough, this function notifies the host kernel
> > + * about the assigning of the physical PCI device and the guest PCI
> > + * parameters. Also called when the guest updates the irq number for a
> > + * passthrough device
> > + *
> > + * \param kvm Pointer to the current kvm_context
> > + * \param pci_pt_dev Parameters like irq, PCI bus, devfn number, etc
> > + */
> > +int kvm_assign_pci_pt_device(kvm_context_t kvm,
> > +                        struct kvm_pci_passthrough_dev *pci_pt_dev);
>
> Is such a high level interface really required?  We need to forward one
> (or potentially more) IRQs to the guest and we need to setup VT-d.
> Shouldn't these be separate operations?

Essentially, we're just notifying the host kernel about a guest device being 
updated. For the first time, it's also about telling the host of a new device 
being registered. Subsequent times (for the same device), it'll just be 
updates about guest IRQs being changed.

Logically, these are two separate things as you mention, but in effect, it's 
the same thing, passing around the same structure.

> >  static void apic_get_delivery_bitmask(uint32_t *deliver_bitmask,
> > @@ -1144,6 +1147,7 @@ static void ioapic_mem_writel(void *opaque,
> > target_phys_addr_t addr, uint32_t va } else {
> >                          s->ioredtbl[index] &= ~0xffffffffULL;
> >                          s->ioredtbl[index] |= val;
> > +                        pt_set_vector(index, (val << 24) >> 24);
>
> Is this an attempt to sign extend a 24-bit value?  It's very odd looking.

You're right; most of this code is from the original developers and I've not 
touched it yet. I've just put the code that needs kvm into kvm_enabled() 
blocks and that's only limited to the part that makes this code work with 
in-kernel irqchip enabled.

> > @@ -997,6 +998,9 @@ static void pc_init1(ram_addr_t ram_size, int
> > vga_ram_size, }
> >      }
> >
> > +    /* Initialize pass-through */
> > +    pt_init(pci_bus);
> > +
>
> Each pass-through device should be treated as a special PCI device.
> This implies that we should be looping in pc_init1() and creating each
> pass through device based on user-provided parameters.  A side effect of
> this is that it will make pci_add much easier to get working which is
> something that's missing from this series.

I've not seen the pci_add codepath; it will be nice if we get that supported.

> > +typedef __u64 resource_size_t;
> > +#define __deprecated
> > +#include <linux/ioport.h>
>
> Neither of these look very good.  Why are we including ioport.h?

ioport is needed for some definitions -- IORESOURCE_MEM, IORESOURCE_PREFETCH, 
etc. The __deprecated is there to make ioport.h compile succeed.

However, there should be a qemu equivalent to the #defines we need.

> > +#include "pci-passthrough.h"
> > +#include "irq.h"
> > +
> > +#include "qemu-kvm.h"
> > +#include <linux/kvm_para.h>
> > +extern kvm_context_t kvm_context;
> > +extern FILE *logfile;
>
> Please avoid open usage of kvm_context.

We already include qemu-kvm.h and this extern isn't needed at all.

> > +#define pt_ioport_read(suffix)                                             
> > \
> > +uint32_t pt_ioport_read##suffix(void *opaque, uint32_t addr)               
> > \
> > +{                                                                  \
> > +   pt_region_t *r_access = (pt_region_t *)opaque;                  \
> > +   uint32_t r_pio = (addr - r_access->e_physbase)                  \
> > +                   + (unsigned long)r_access->r_virtbase;          \
> > +   uint32_t value = in##suffix(r_pio);                             \
>
> FYI, you can get inl et al from sys/io.h


> > +static void pt_iomem_map(PCIDevice * d, int region_num,
> > +                    uint32_t e_phys, uint32_t e_size, int type)
> > +{
> > +   pt_dev_t *r_dev = (pt_dev_t *) d;
>
> Please try to make this look more QEMU like by having
> PassthroughDeviceState or something like that.

This perhaps is going to be more difficult than the rest.

> > +   pt_dev_t *r_dev = (pt_dev_t *) pci_dev;
> > +   int i;
> > +   uint32_t ((*rf[])(void *, uint32_t)) =  { pt_ioport_readb,
> > +                                             pt_ioport_readw,
> > +                                             pt_ioport_readl
> > +                                           };
> > +   void ((*wf[])(void *, uint32_t, uint32_t)) = { pt_ioport_writeb,
> > +                                                  pt_ioport_writew,
> > +                                                  pt_ioport_writel
> > +                                                };
>
> Please make this a static array like the memory operations or at least
> introduce a proper typedef.

> > +   DEBUG("NON BAR (%x.%x): address=%04x val=0x%08x len=%d\n",
> > +         ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), (uint16_t) address,
> > +         val, len);
> > +   fd = ((pt_dev_t *)d)->real_device.config_fd;
> > +   lseek(fd, address, SEEK_SET);
> > +   write(fd, &val, len);
>
> Is it possible to mmap() the config space?  This would look better.
> Right now, you don't deal with EINTR which is a Bad Thing.
>
> > +   /* vga specific, remove later */
> > +   if (address == 0xFC)
> > +           goto do_log;
>
> Can you explain this?

No; maybe some artifact left over when they tried to get a graphics device 
assigned.

> > +           pci_dev->v_addrs[i].e_physbase = cur_region->base_addr;
> > +           pci_dev->v_addrs[i].r_virtbase = (void 
> > *)(long)cur_region->base_addr;
> > +           pci_dev->v_addrs[i].memory_index = 0;   // not relevant for 
> > port io
>
> I see no reason why we couldn't allow PIO pass-through too.  Marcelo's
> already added a userspace interface for it.

> > +   sprintf(dir, "/sys/bus/pci/devices/0000:%02x:%02x.%x/",
> > +           r_bus, r_dev, r_func);
> > +   strcpy(name, dir);
> > +   strcat(name, "config");
>
> snprintf please, also remove all uses of strcpy or stract.
>
> > +   if ((fd = open(name, O_RDWR)) == -1) {
> > +           fprintf(logfile, "%s: %m\n", name);
> > +           return 1;
> > +   }
> > +   dev->config_fd = fd;
> > +   read(fd, pci_dev->dev.config, sizeof pci_dev->dev.config);
>
> Can't assume read() will always succeed.  Especially in QEMU.  It's very
> likely to fail with EINTR.
>
> > +   strcpy(name, dir);
> > +   strcat(name, "resource");
> > +   if ((f = fopen(name, "r")) == NULL) {
> > +           fprintf(logfile, "%s: %m\n", name);
> > +           return 1;
> > +   }
>
> Please log to stderr for this sort of error.


> > +#ifdef KVM_CAP_PCI_PASSTHROUGH
> > +   if (kvm_enabled()) {
> > +           struct kvm_pci_passthrough_dev pci_pt_dev;
> > +
> > +           pci_pt_dev.guest.busnr  = pci_bus_num(e_bus);
> > +           pci_pt_dev.guest.devfn  = PCI_DEVFN(e_device, r_func);
> > +           pci_pt_dev.host.busnr   = r_bus;
> > +           pci_pt_dev.host.devfn   = PCI_DEVFN(r_dev, r_func);
> > +
> > +           if (qemu_kvm_irqchip_in_kernel()) {
> > +                   /* We'll set the value of the guest irq as and when
> > +                    * the piix config gets updated. See pci_pt_update_irq
> > +                    */
> > +                   pci_pt_dev.guest.irq = 0;
> > +                   pci_pt_dev.host.irq = machine_irq;
> > +           }
> > +           rc = kvm_assign_pci_pt_device(kvm_context, &pci_pt_dev);
> > +           if (rc < 0) {
> > +                   fprintf(stderr, "Could not notify kernel about "
> > +                           "passthrough device\n");
> > +                   perror("pt-ioctl");
> > +                   return NULL;
> > +           }
> > +   }
> > +#endif
>
> This whole file is useless without KVM_CAP_PCI_PASSTHROUGH so you should
> make it only get compiled into QEMU if that capability is present.  Take
> a look at the in-kernel PIT for an example of this in QEMU.

We don't need this functionality of communicating with the host in case of 1-1 
mapping of the guest in the host with the irqhook module, for example. So 
this will work without the CAP.

> > +int nptdevs;
> > +extern int piix_get_irq(int);
> > +int pt_init(PCIBus * bus)
> > +{
> > +   pt_dev_t *dev = NULL;
> > +   int i, ret = 0;
> > +
> > +   iopl(3);
>
> An advantage of using PIO passthrough is we could avoid doing iopl(3).
> This gives us more privileges than we actually need.  In fact, doing PIO
> means we can't even write to the PIO ports in userspace which seems like
> a nice thing to me from a security perspective.


> > --- /dev/null
> > +++ b/qemu/hw/pci-passthrough.h

> > +typedef __u8 u8;
> > +typedef __u16 u16;
> > +typedef __u32 u32;
> > +typedef __u64 u64;
>
> This should go away.


> > +/* Initialization functions */
> > +int pt_init(PCIBus * bus);
> > +void add_pci_passthrough_device(const char *arg);
> > +void pt_set_vector(int irq, int vector);
> > +void pt_ack_mirq(int vector);
>
> Is there a reason these aren't static?

Yes; these are called from either vl.c or apic.c

> > +#define logfile stderr
>
> Oh this is a very bad thing :-)

At least the previous comment got answered :-)

Yes, everything needs cleanups.

> > diff --git a/tools/pci_barsize.c b/tools/pci_barsize.c
>
> While these may be useful for your debugging, I don't know if we want to
> include them in KVM by default.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to