On Thu, Jan 07, 2010 at 04:26:24PM -0800, Davide Libenzi wrote:
> On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
>
> > Sure, I was trying to be as brief as possible, here's a detailed summary.
> >
> > Description of the system (MSI emulation in KVM):
> >
> > KVM supports an ioctl to assign/deassign an eventfd file to interrupt
> > message
> > in guest OS. When this eventfd is signalled, interrupt message is sent.
> > This assignment is done from qemu system emulator.
> >
> > eventfd is signalled from device emulation in another thread in
> > userspace or from kernel, which talks with guest OS through another
> > eventfd and shared memory (possibility of out of process was discussed
> > but never got implemented yet).
> >
> > Note: it's okay to delay messages from correctness point of view, but
> > generally this is latency-sensitive path. If multiple identical messages
> > are requested, it's okay to send a single last message, but missing a
> > message altogether causes deadlocks. Sending a message when none were
> > requested might in theory cause crashes, in practice doing this causes
> > performance degradation.
> >
> > Another KVM feature is interrupt masking: guest OS requests that we
> > stop sending some interrupt message, possibly modified mapping
> > and re-enables this message. This needs to be done without
> > involving the device that might keep requesting events:
> > while masked, message is marked "pending", and guest might test
> > the pending status.
> >
> > We can implement masking in system emulator in userspace, by using
> > assign/deassign ioctls: when message is masked, we simply deassign all
> > eventfd, and when it is unmasked, we assign them back.
> >
> > Here's some code to illustrate how this all works: assign/deassign code
> > in kernel looks like the following:
> >
> >
> > this is called to unmask interrupt
> >
> > static int
> > kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> > {
> > struct _irqfd *irqfd, *tmp;
> > struct file *file = NULL;
> > struct eventfd_ctx *eventfd = NULL;
> > int ret;
> > unsigned int events;
> >
> > irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> >
> > ...
> >
> > file = eventfd_fget(fd);
> > if (IS_ERR(file)) {
> > ret = PTR_ERR(file);
> > goto fail;
> > }
> >
> > eventfd = eventfd_ctx_fileget(file);
> > if (IS_ERR(eventfd)) {
> > ret = PTR_ERR(eventfd);
> > goto fail;
> > }
> >
> > irqfd->eventfd = eventfd;
> >
> > /*
> > * Install our own custom wake-up handling so we are notified via
> > * a callback whenever someone signals the underlying eventfd
> > */
> > init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
> > init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
> >
> > spin_lock_irq(&kvm->irqfds.lock);
> >
> > events = file->f_op->poll(file, &irqfd->pt);
> >
> > list_add_tail(&irqfd->list, &kvm->irqfds.items);
> > spin_unlock_irq(&kvm->irqfds.lock);
> >
> > A.
> > /*
> > * Check if there was an event already pending on the eventfd
> > * before we registered, and trigger it as if we didn't miss it.
> > */
> > if (events & POLLIN)
> > schedule_work(&irqfd->inject);
> >
> > /*
> > * do not drop the file until the irqfd is fully initialized, otherwise
> > * we might race against the POLLHUP
> > */
> > fput(file);
> >
> > return 0;
> >
> > fail:
> > ...
> > }
>
> What is you do (under proper irqfd locking) something like:
>
> eventfd_ctx_read(ctx, 1, &cnt);
> if (irqfd->cnt != cnt) {
> irqfd->cnt = cnt;
> schedule_work(&irqfd->inject);
> }
>
>
>
>
> > And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
> > so this is not under the spinlock):
> >
> > /*
> > * Synchronize with the wait-queue and unhook ourselves to
> > * prevent
> > * further events.
> > */
> > B.
> > remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >
> > ....
> >
> > /*
> > * It is now safe to release the object's resources
> > */
> > eventfd_ctx_put(irqfd->eventfd);
> > kfree(irqfd);
>
> And:
>
> eventfd_ctx_read(ctx, 1, &irqfd->cnt);
->
> remove_wait_queue(irqfd->wqh, &irqfd->wait);
>
>
>
>
> - Davide
Yes, this is exactly what I wanted to do. So, here's the issue: if an
event is signalled at point ->: after eventfd_ctx_read but before
remove_wait_queue, then we inject interrupt but counter will be left
non-zero and then when we unmask, we inject antoher, spurious interrupt.
This is why I wanted to have eventfd_ctx_read not take wait queue head
lock: then I could do:
spin_lock_irqsave(&ctx->wqh.lock, flags);
eventfd_ctx_read(ctx, 1, &irqfd->cnt);
__remove_wait_queue(irqfd->wqh, &irqfd->wait);
spin_lock_irqrestore(&ctx->wqh.lock, flags);
--
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html