On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
> Sure, I was trying to be as brief as possible, here's a detailed summary.
>
> Description of the system (MSI emulation in KVM):
>
> KVM supports an ioctl to assign/deassign an eventfd file to interrupt message
> in guest OS. When this eventfd is signalled, interrupt message is sent.
> This assignment is done from qemu system emulator.
>
> eventfd is signalled from device emulation in another thread in
> userspace or from kernel, which talks with guest OS through another
> eventfd and shared memory (possibility of out of process was discussed
> but never got implemented yet).
>
> Note: it's okay to delay messages from correctness point of view, but
> generally this is latency-sensitive path. If multiple identical messages
> are requested, it's okay to send a single last message, but missing a
> message altogether causes deadlocks. Sending a message when none were
> requested might in theory cause crashes, in practice doing this causes
> performance degradation.
>
> Another KVM feature is interrupt masking: guest OS requests that we
> stop sending some interrupt message, possibly modified mapping
> and re-enables this message. This needs to be done without
> involving the device that might keep requesting events:
> while masked, message is marked "pending", and guest might test
> the pending status.
>
> We can implement masking in system emulator in userspace, by using
> assign/deassign ioctls: when message is masked, we simply deassign all
> eventfd, and when it is unmasked, we assign them back.
>
> Here's some code to illustrate how this all works: assign/deassign code
> in kernel looks like the following:
>
>
> this is called to unmask interrupt
>
> static int
> kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
> {
> struct _irqfd *irqfd, *tmp;
> struct file *file = NULL;
> struct eventfd_ctx *eventfd = NULL;
> int ret;
> unsigned int events;
>
> irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
>
> ...
>
> file = eventfd_fget(fd);
> if (IS_ERR(file)) {
> ret = PTR_ERR(file);
> goto fail;
> }
>
> eventfd = eventfd_ctx_fileget(file);
> if (IS_ERR(eventfd)) {
> ret = PTR_ERR(eventfd);
> goto fail;
> }
>
> irqfd->eventfd = eventfd;
>
> /*
> * Install our own custom wake-up handling so we are notified via
> * a callback whenever someone signals the underlying eventfd
> */
> init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
> init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
>
> spin_lock_irq(&kvm->irqfds.lock);
>
> events = file->f_op->poll(file, &irqfd->pt);
>
> list_add_tail(&irqfd->list, &kvm->irqfds.items);
> spin_unlock_irq(&kvm->irqfds.lock);
>
> A.
> /*
> * Check if there was an event already pending on the eventfd
> * before we registered, and trigger it as if we didn't miss it.
> */
> if (events & POLLIN)
> schedule_work(&irqfd->inject);
>
> /*
> * do not drop the file until the irqfd is fully initialized, otherwise
> * we might race against the POLLHUP
> */
> fput(file);
>
> return 0;
>
> fail:
> ...
> }
>
> This is called to mask interrupt
>
> /*
> * shutdown any irqfd's that match fd+gsi
> */
> static int
> kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
> {
> struct _irqfd *irqfd, *tmp;
> struct eventfd_ctx *eventfd;
>
> eventfd = eventfd_ctx_fdget(fd);
> if (IS_ERR(eventfd))
> return PTR_ERR(eventfd);
>
> spin_lock_irq(&kvm->irqfds.lock);
>
> list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
> if (irqfd->eventfd == eventfd && irqfd->gsi == gsi)
> irqfd_deactivate(irqfd);
> }
>
> spin_unlock_irq(&kvm->irqfds.lock);
> eventfd_ctx_put(eventfd);
>
> /*
> * Block until we know all outstanding shutdown jobs have completed
> * so that we guarantee there will not be any more interrupts on this
> * gsi once this deassign function returns.
> */
> flush_workqueue(irqfd_cleanup_wq);
>
> return 0;
> }
>
>
> And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
> so this is not under the spinlock):
>
> /*
> * Synchronize with the wait-queue and unhook ourselves to
> * prevent
> * further events.
> */
> B.
> remove_wait_queue(irqfd->wqh, &irqfd->wait);
>
> ....
>
> /*
> * It is now safe to release the object's resources
> */
> eventfd_ctx_put(irqfd->eventfd);
> kfree(irqfd);
>
>
> The problems (really the same bug) in KVM that I am trying to fix:
> 1. Because of A above, if event was requested while message was masked,
> we will not miss a message. However, because we never clear
> the counter, so we currently get a spurious message each time
> we unmask.
>
> We should clear the counter either each time we
> deliver message, or at least when message is masked.
>
> 2. We currently always return pending status 0 to guests,
> but for strict spec compliance we must do it correctly
> and report pending message if one was requested while
> it was masked.
>
> An obvious way to do this is to do nonblocking read on the eventfd.
> Since this clears the counter, we will have to write the value
> back. But again, this requires that all messages that were
> sent before mask are cleared from the counter, otherwise
> we will always report pending messages.
>
>
> So, how should I clear the counter?
>
> I started with the idea of clearing it in irqfd_wakeup. However, as you
> point out this will cause another list scan, so it's messy. It
> currently uses a workqueue to work around KVM locking issue, but the
> issue is fixed so we will stop doing this soon, so can not rely on that.
>
> Thus I came up with the idea to do this on unmask:
> this is slow path and can always be done from workqueue or ioctl
> context. So I would add eventfd_read_ctx in code at point B:
>
>
> eventfd_read_ctx(irqfd->eventfd, &ucnt);
> ->
> remove_wait_queue(irqfd->wqh, &irqfd->wait);
>
> now, if device signals eventfd at point marked by ->,
> it will be sent but counter not cleared,
> so we will get spurious message when we unmask.
>
> Or if I do it the other way:
> remove_wait_queue(irqfd->wqh, &irqfd->wait);
> ->
> eventfd_read_ctx(irqfd->eventfd, &ucnt);
>
> now, if device signals eventfd at point marked by ->,
> it will not be sent but counter will be cleared,
> so we will loose a message.
>
> There does not appear to be a way to avoid this race unless we both
> remove from wait queue and clear counter under the wait queue lock, just
> like eventfd_read currently does.
>
> I hope this clarifies.
Yes, it sure does, thanks.
When you unmask, shouldn't the fd be in a "clear" state anyway?
So, can't you just do eventfd_ctx_read(ctx, 1, &ucnt) on unmask, and
ignore the value (and the POLLIN flags)?
Another solution, is to keep track of the counter (using
eventfd_ctx_read() and eventfd_ctx_peek_value()) inside your code, and
decide the status based on its changes.
- Davide
---
fs/eventfd.c | 74 ++++++++++++++++++++++++++++++++++++++----------
include/linux/eventfd.h | 13 ++++++++
2 files changed, 72 insertions(+), 15 deletions(-)
Index: linux-2.6.mod/fs/eventfd.c
===================================================================
--- linux-2.6.mod.orig/fs/eventfd.c 2010-01-06 12:37:28.000000000 -0800
+++ linux-2.6.mod/fs/eventfd.c 2010-01-06 17:43:09.000000000 -0800
@@ -135,26 +135,56 @@ static unsigned int eventfd_poll(struct
return events;
}
-static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
- loff_t *ppos)
+/**
+ * eventfd_ctx_peek_value - Reads the eventfd counter.
+ * @ctx: [in] Pointer to eventfd context.
+ *
+ * NOTE: Must be called with the wait queue lock held. Locking the
+ * eventfd_ctx interface is not actually exposed outside of the
+ * eventfd implementation, but when poll callback registers with the
+ * eventfd poll, such callbacks will be invoked with the wait queue
+ * lock held. IOW this means that this function can only be called from
+ * inside a registered poll callback at the moment.
+ * The function returns the current counter value, without consuming it.
+ */
+__u64 eventfd_ctx_peek_value(struct eventfd_ctx *ctx)
+{
+ assert_spin_locked(&ctx->wqh.lock);
+
+ return ctx->count;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_peek_value);
+
+/**
+ * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero.
+ * @ctx: [in] Pointer to eventfd context.
+ * @no_wait: [in] Different from zero if the operation should not block.
+ * @cnt: [out] Pointer to the 64bit conter value.
+ *
+ * Returns zero if successful, or the following error codes:
+ *
+ * -EAGAIN : The operation would have blocked but @no_wait was nonzero.
+ * -ERESTARTSYS : A signal interrupted the wait operation.
+ *
+ * If @no_wait is zero, the function might sleep until the eventfd internal
+ * counter becomes greater than zero.
+ */
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)
{
- struct eventfd_ctx *ctx = file->private_data;
ssize_t res;
- __u64 ucnt = 0;
DECLARE_WAITQUEUE(wait, current);
- if (count < sizeof(ucnt))
- return -EINVAL;
spin_lock_irq(&ctx->wqh.lock);
+ *cnt = 0;
res = -EAGAIN;
if (ctx->count > 0)
- res = sizeof(ucnt);
- else if (!(file->f_flags & O_NONBLOCK)) {
+ res = 0;
+ else if (!no_wait) {
__add_wait_queue(&ctx->wqh, &wait);
- for (res = 0;;) {
+ for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
if (ctx->count > 0) {
- res = sizeof(ucnt);
+ res = 0;
break;
}
if (signal_pending(current)) {
@@ -168,18 +198,32 @@ static ssize_t eventfd_read(struct file
__remove_wait_queue(&ctx->wqh, &wait);
__set_current_state(TASK_RUNNING);
}
- if (likely(res > 0)) {
- ucnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
- ctx->count -= ucnt;
+ if (likely(res == 0)) {
+ *cnt = (ctx->flags & EFD_SEMAPHORE) ? 1: ctx->count;
+ ctx->count -= *cnt;
if (waitqueue_active(&ctx->wqh))
wake_up_locked_poll(&ctx->wqh, POLLOUT);
}
spin_unlock_irq(&ctx->wqh.lock);
- if (res > 0 && put_user(ucnt, (__u64 __user *) buf))
- return -EFAULT;
return res;
}
+EXPORT_SYMBOL_GPL(eventfd_ctx_read);
+
+static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
+ loff_t *ppos)
+{
+ struct eventfd_ctx *ctx = file->private_data;
+ ssize_t res;
+ __u64 cnt;
+
+ if (count < sizeof(cnt))
+ return -EINVAL;
+ if ((res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt)) < 0)
+ return res;
+
+ return put_user(cnt, (__u64 __user *) buf) ? -EFAULT: sizeof(cnt);
+}
static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t
count,
loff_t *ppos)
Index: linux-2.6.mod/include/linux/eventfd.h
===================================================================
--- linux-2.6.mod.orig/include/linux/eventfd.h 2010-01-06 12:37:28.000000000
-0800
+++ linux-2.6.mod/include/linux/eventfd.h 2010-01-06 14:54:36.000000000
-0800
@@ -34,6 +34,8 @@ struct file *eventfd_fget(int fd);
struct eventfd_ctx *eventfd_ctx_fdget(int fd);
struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
int eventfd_signal(struct eventfd_ctx *ctx, int n);
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt);
+__u64 eventfd_ctx_peek_value(struct eventfd_ctx *ctx);
#else /* CONFIG_EVENTFD */
@@ -61,6 +63,17 @@ static inline void eventfd_ctx_put(struc
}
+static inline ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait,
+ __u64 *cnt)
+{
+ return -ENOSYS;
+}
+
+static inline __u64 eventfd_ctx_peek_value(struct eventfd_ctx *ctx)
+{
+ return 0;
+}
+
#endif
#endif /* _LINUX_EVENTFD_H */
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html