Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-06 Thread Gleb Natapov
On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
  I'd like to also support EOI handling. When the guest clears the
  interrupt condtion, the EOI callback would be called. This could occur
  much later than the IRQ delivery time. I'm not sure if we need the
  result code in that case.
  
  If any intermediate device (IOAPIC?) needs to be informed about either
  delivery or EOI also, it could create a proxy message with its
  callbacks in place. But we need then a separate opaque field (in
  addition to payload) to store the original message.
  
  struct IRQMsg {
   DeviceState *src;
   void (*delivery_cb)(IRQMsg *msg, int result);
   void (*eoi_cb)(IRQMsg *msg, int result);
   void *src_opaque;
   void *payload;
  };
 
 Extending the lifetime of IRQMsg objects beyond the delivery call stack
 means qemu_malloc/free for every delivery. I think it takes a _very_
 appealing reason to justify this. But so far I do not see any use case
 for eio_cb at all.
 
I dislike use of eoi for reinfecting missing interrupts since
it eliminates use of internal PIC/APIC queue of not yet delivered
interrupts. PIC and APIC has internal queue that can handle two elements:
one is delivered, but not yet acked interrupt in isr and another is
pending interrupt in irr. Using eoi callback (or ack notifier as it's
called inside kernel) interrupt will be considered coalesced even if irr
is cleared, but no ack was received for previously delivered interrupt.
But ack notifiers actually has another use: device assignment. There is
a plan to move device assignment from kernel to userspace and for that
ack notifiers will have to be extended to userspace too. If so we can
use them to do irq decoalescing as well. I doubt they should be part
of IRQMsg though. Why not do what kernel does: have globally registered
notifier based on irqchip/pin.

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-06 Thread Jan Kiszka
Gleb Natapov wrote:
 On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
 I'd like to also support EOI handling. When the guest clears the
 interrupt condtion, the EOI callback would be called. This could occur
 much later than the IRQ delivery time. I'm not sure if we need the
 result code in that case.

 If any intermediate device (IOAPIC?) needs to be informed about either
 delivery or EOI also, it could create a proxy message with its
 callbacks in place. But we need then a separate opaque field (in
 addition to payload) to store the original message.

 struct IRQMsg {
  DeviceState *src;
  void (*delivery_cb)(IRQMsg *msg, int result);
  void (*eoi_cb)(IRQMsg *msg, int result);
  void *src_opaque;
  void *payload;
 };
 Extending the lifetime of IRQMsg objects beyond the delivery call stack
 means qemu_malloc/free for every delivery. I think it takes a _very_
 appealing reason to justify this. But so far I do not see any use case
 for eio_cb at all.

 I dislike use of eoi for reinfecting missing interrupts since
 it eliminates use of internal PIC/APIC queue of not yet delivered
 interrupts. PIC and APIC has internal queue that can handle two elements:
 one is delivered, but not yet acked interrupt in isr and another is
 pending interrupt in irr. Using eoi callback (or ack notifier as it's
 called inside kernel) interrupt will be considered coalesced even if irr
 is cleared, but no ack was received for previously delivered interrupt.
 But ack notifiers actually has another use: device assignment. There is
 a plan to move device assignment from kernel to userspace and for that
 ack notifiers will have to be extended to userspace too. If so we can
 use them to do irq decoalescing as well. I doubt they should be part
 of IRQMsg though. Why not do what kernel does: have globally registered
 notifier based on irqchip/pin.

I read this twice but I still don't get your plan. Do you like or
dislike using EIO for de-coalescing? And how should these notifiers work?

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-06 Thread Blue Swirl
On Sun, Jun 6, 2010 at 7:15 AM, Gleb Natapov g...@redhat.com wrote:
 On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
  I'd like to also support EOI handling. When the guest clears the
  interrupt condtion, the EOI callback would be called. This could occur
  much later than the IRQ delivery time. I'm not sure if we need the
  result code in that case.
 
  If any intermediate device (IOAPIC?) needs to be informed about either
  delivery or EOI also, it could create a proxy message with its
  callbacks in place. But we need then a separate opaque field (in
  addition to payload) to store the original message.
 
  struct IRQMsg {
   DeviceState *src;
   void (*delivery_cb)(IRQMsg *msg, int result);
   void (*eoi_cb)(IRQMsg *msg, int result);
   void *src_opaque;
   void *payload;
  };

 Extending the lifetime of IRQMsg objects beyond the delivery call stack
 means qemu_malloc/free for every delivery. I think it takes a _very_
 appealing reason to justify this. But so far I do not see any use case
 for eio_cb at all.

 I dislike use of eoi for reinfecting missing interrupts since
 it eliminates use of internal PIC/APIC queue of not yet delivered
 interrupts. PIC and APIC has internal queue that can handle two elements:
 one is delivered, but not yet acked interrupt in isr and another is
 pending interrupt in irr. Using eoi callback (or ack notifier as it's
 called inside kernel) interrupt will be considered coalesced even if irr
 is cleared, but no ack was received for previously delivered interrupt.
 But ack notifiers actually has another use: device assignment. There is
 a plan to move device assignment from kernel to userspace and for that
 ack notifiers will have to be extended to userspace too. If so we can
 use them to do irq decoalescing as well. I doubt they should be part
 of IRQMsg though. Why not do what kernel does: have globally registered
 notifier based on irqchip/pin.

Because translation at IOAPIC may be lossy, IRQs from many devices
pointing to the same vector? With IRQMsg you know where a specific
message came from. The situation is different inside the kernel: it
manages both translation and registration, whereas in QEMU we could
only control registration.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-06 Thread Gleb Natapov
On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
 Gleb Natapov wrote:
  On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
  I'd like to also support EOI handling. When the guest clears the
  interrupt condtion, the EOI callback would be called. This could occur
  much later than the IRQ delivery time. I'm not sure if we need the
  result code in that case.
 
  If any intermediate device (IOAPIC?) needs to be informed about either
  delivery or EOI also, it could create a proxy message with its
  callbacks in place. But we need then a separate opaque field (in
  addition to payload) to store the original message.
 
  struct IRQMsg {
   DeviceState *src;
   void (*delivery_cb)(IRQMsg *msg, int result);
   void (*eoi_cb)(IRQMsg *msg, int result);
   void *src_opaque;
   void *payload;
  };
  Extending the lifetime of IRQMsg objects beyond the delivery call stack
  means qemu_malloc/free for every delivery. I think it takes a _very_
  appealing reason to justify this. But so far I do not see any use case
  for eio_cb at all.
 
  I dislike use of eoi for reinfecting missing interrupts since
  it eliminates use of internal PIC/APIC queue of not yet delivered
  interrupts. PIC and APIC has internal queue that can handle two elements:
  one is delivered, but not yet acked interrupt in isr and another is
  pending interrupt in irr. Using eoi callback (or ack notifier as it's
  called inside kernel) interrupt will be considered coalesced even if irr
  is cleared, but no ack was received for previously delivered interrupt.
  But ack notifiers actually has another use: device assignment. There is
  a plan to move device assignment from kernel to userspace and for that
  ack notifiers will have to be extended to userspace too. If so we can
  use them to do irq decoalescing as well. I doubt they should be part
  of IRQMsg though. Why not do what kernel does: have globally registered
  notifier based on irqchip/pin.
 
 I read this twice but I still don't get your plan. Do you like or
 dislike using EIO for de-coalescing? And how should these notifiers work?
 
That's because I confused myself :) I _dislike_ them to be used, but
since device assignment requires ack notifiers anyway may be it is better
to introduce one mechanism for device assignmen + de-coalescing instead
of introducing two different mechanism. Using ack notifiers should be
easy: RTC registers ack notifier and keep track of delivered interrupts.
If timer triggers after previews irq was set, but before it was acked
coalesced counter is incremented. In ack notifier callback coalesced
counter is checked and if it is not zero new irq is set.

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-06 Thread Gleb Natapov
On Sun, Jun 06, 2010 at 07:39:49AM +, Blue Swirl wrote:
 On Sun, Jun 6, 2010 at 7:15 AM, Gleb Natapov g...@redhat.com wrote:
  On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
   I'd like to also support EOI handling. When the guest clears the
   interrupt condtion, the EOI callback would be called. This could occur
   much later than the IRQ delivery time. I'm not sure if we need the
   result code in that case.
  
   If any intermediate device (IOAPIC?) needs to be informed about either
   delivery or EOI also, it could create a proxy message with its
   callbacks in place. But we need then a separate opaque field (in
   addition to payload) to store the original message.
  
   struct IRQMsg {
    DeviceState *src;
    void (*delivery_cb)(IRQMsg *msg, int result);
    void (*eoi_cb)(IRQMsg *msg, int result);
    void *src_opaque;
    void *payload;
   };
 
  Extending the lifetime of IRQMsg objects beyond the delivery call stack
  means qemu_malloc/free for every delivery. I think it takes a _very_
  appealing reason to justify this. But so far I do not see any use case
  for eio_cb at all.
 
  I dislike use of eoi for reinfecting missing interrupts since
  it eliminates use of internal PIC/APIC queue of not yet delivered
  interrupts. PIC and APIC has internal queue that can handle two elements:
  one is delivered, but not yet acked interrupt in isr and another is
  pending interrupt in irr. Using eoi callback (or ack notifier as it's
  called inside kernel) interrupt will be considered coalesced even if irr
  is cleared, but no ack was received for previously delivered interrupt.
  But ack notifiers actually has another use: device assignment. There is
  a plan to move device assignment from kernel to userspace and for that
  ack notifiers will have to be extended to userspace too. If so we can
  use them to do irq decoalescing as well. I doubt they should be part
  of IRQMsg though. Why not do what kernel does: have globally registered
  notifier based on irqchip/pin.
 
 Because translation at IOAPIC may be lossy, IRQs from many devices
 pointing to the same vector? With IRQMsg you know where a specific
 message came from. The situation is different inside the kernel: it
 manages both translation and registration, whereas in QEMU we could
 only control registration.
Configuring IOAPIC like that is against x86 architecture. OS will not be
able to map from interrupt vector back to device.

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-06 Thread Jan Kiszka
Gleb Natapov wrote:
 On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
 Gleb Natapov wrote:
 On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
 I'd like to also support EOI handling. When the guest clears the
 interrupt condtion, the EOI callback would be called. This could occur
 much later than the IRQ delivery time. I'm not sure if we need the
 result code in that case.

 If any intermediate device (IOAPIC?) needs to be informed about either
 delivery or EOI also, it could create a proxy message with its
 callbacks in place. But we need then a separate opaque field (in
 addition to payload) to store the original message.

 struct IRQMsg {
  DeviceState *src;
  void (*delivery_cb)(IRQMsg *msg, int result);
  void (*eoi_cb)(IRQMsg *msg, int result);
  void *src_opaque;
  void *payload;
 };
 Extending the lifetime of IRQMsg objects beyond the delivery call stack
 means qemu_malloc/free for every delivery. I think it takes a _very_
 appealing reason to justify this. But so far I do not see any use case
 for eio_cb at all.

 I dislike use of eoi for reinfecting missing interrupts since
 it eliminates use of internal PIC/APIC queue of not yet delivered
 interrupts. PIC and APIC has internal queue that can handle two elements:
 one is delivered, but not yet acked interrupt in isr and another is
 pending interrupt in irr. Using eoi callback (or ack notifier as it's
 called inside kernel) interrupt will be considered coalesced even if irr
 is cleared, but no ack was received for previously delivered interrupt.
 But ack notifiers actually has another use: device assignment. There is
 a plan to move device assignment from kernel to userspace and for that
 ack notifiers will have to be extended to userspace too. If so we can
 use them to do irq decoalescing as well. I doubt they should be part
 of IRQMsg though. Why not do what kernel does: have globally registered
 notifier based on irqchip/pin.
 I read this twice but I still don't get your plan. Do you like or
 dislike using EIO for de-coalescing? And how should these notifiers work?

 That's because I confused myself :) I _dislike_ them to be used, but
 since device assignment requires ack notifiers anyway may be it is better
 to introduce one mechanism for device assignmen + de-coalescing instead
 of introducing two different mechanism. Using ack notifiers should be
 easy: RTC registers ack notifier and keep track of delivered interrupts.
 If timer triggers after previews irq was set, but before it was acked
 coalesced counter is incremented. In ack notifier callback coalesced
 counter is checked and if it is not zero new irq is set.

Ack notifier registrations and event deliveries still need to be routed.
Piggy-backing this on IRQ messages may be unavoidable for that reason.

Anyway, I'm going to post my HPET updates with the infrastructure for
IRQMsg now. Maybe it's helpful to see the other option in reality.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-06 Thread Gleb Natapov
On Sun, Jun 06, 2010 at 10:07:48AM +0200, Jan Kiszka wrote:
 Gleb Natapov wrote:
  On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
  Gleb Natapov wrote:
  On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
  I'd like to also support EOI handling. When the guest clears the
  interrupt condtion, the EOI callback would be called. This could occur
  much later than the IRQ delivery time. I'm not sure if we need the
  result code in that case.
 
  If any intermediate device (IOAPIC?) needs to be informed about either
  delivery or EOI also, it could create a proxy message with its
  callbacks in place. But we need then a separate opaque field (in
  addition to payload) to store the original message.
 
  struct IRQMsg {
   DeviceState *src;
   void (*delivery_cb)(IRQMsg *msg, int result);
   void (*eoi_cb)(IRQMsg *msg, int result);
   void *src_opaque;
   void *payload;
  };
  Extending the lifetime of IRQMsg objects beyond the delivery call stack
  means qemu_malloc/free for every delivery. I think it takes a _very_
  appealing reason to justify this. But so far I do not see any use case
  for eio_cb at all.
 
  I dislike use of eoi for reinfecting missing interrupts since
  it eliminates use of internal PIC/APIC queue of not yet delivered
  interrupts. PIC and APIC has internal queue that can handle two elements:
  one is delivered, but not yet acked interrupt in isr and another is
  pending interrupt in irr. Using eoi callback (or ack notifier as it's
  called inside kernel) interrupt will be considered coalesced even if irr
  is cleared, but no ack was received for previously delivered interrupt.
  But ack notifiers actually has another use: device assignment. There is
  a plan to move device assignment from kernel to userspace and for that
  ack notifiers will have to be extended to userspace too. If so we can
  use them to do irq decoalescing as well. I doubt they should be part
  of IRQMsg though. Why not do what kernel does: have globally registered
  notifier based on irqchip/pin.
  I read this twice but I still don't get your plan. Do you like or
  dislike using EIO for de-coalescing? And how should these notifiers work?
 
  That's because I confused myself :) I _dislike_ them to be used, but
  since device assignment requires ack notifiers anyway may be it is better
  to introduce one mechanism for device assignmen + de-coalescing instead
  of introducing two different mechanism. Using ack notifiers should be
  easy: RTC registers ack notifier and keep track of delivered interrupts.
  If timer triggers after previews irq was set, but before it was acked
  coalesced counter is incremented. In ack notifier callback coalesced
  counter is checked and if it is not zero new irq is set.
 
 Ack notifier registrations and event deliveries still need to be routed.
 Piggy-backing this on IRQ messages may be unavoidable for that reason.
It is done in the kernel without piggy-backing.

 
 Anyway, I'm going to post my HPET updates with the infrastructure for
 IRQMsg now. Maybe it's helpful to see the other option in reality.
 
One other think to consider current approach does not always work.
Win2K3-64bit-smp and Win2k8-64bit-smp configure RTC interrupt to be
broadcasted to all cpus, but only boot cpu does time calculation. With
current approach if interrupt is delivered to at least one vcpu
it will not be considered coalesced, but if cpu it was delivered to is
not cpu that does time accounting then clock will drift.

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-06 Thread Jan Kiszka
Gleb Natapov wrote:
 On Sun, Jun 06, 2010 at 10:07:48AM +0200, Jan Kiszka wrote:
 Gleb Natapov wrote:
 On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
 Gleb Natapov wrote:
 On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
 I'd like to also support EOI handling. When the guest clears the
 interrupt condtion, the EOI callback would be called. This could occur
 much later than the IRQ delivery time. I'm not sure if we need the
 result code in that case.

 If any intermediate device (IOAPIC?) needs to be informed about either
 delivery or EOI also, it could create a proxy message with its
 callbacks in place. But we need then a separate opaque field (in
 addition to payload) to store the original message.

 struct IRQMsg {
  DeviceState *src;
  void (*delivery_cb)(IRQMsg *msg, int result);
  void (*eoi_cb)(IRQMsg *msg, int result);
  void *src_opaque;
  void *payload;
 };
 Extending the lifetime of IRQMsg objects beyond the delivery call stack
 means qemu_malloc/free for every delivery. I think it takes a _very_
 appealing reason to justify this. But so far I do not see any use case
 for eio_cb at all.

 I dislike use of eoi for reinfecting missing interrupts since
 it eliminates use of internal PIC/APIC queue of not yet delivered
 interrupts. PIC and APIC has internal queue that can handle two elements:
 one is delivered, but not yet acked interrupt in isr and another is
 pending interrupt in irr. Using eoi callback (or ack notifier as it's
 called inside kernel) interrupt will be considered coalesced even if irr
 is cleared, but no ack was received for previously delivered interrupt.
 But ack notifiers actually has another use: device assignment. There is
 a plan to move device assignment from kernel to userspace and for that
 ack notifiers will have to be extended to userspace too. If so we can
 use them to do irq decoalescing as well. I doubt they should be part
 of IRQMsg though. Why not do what kernel does: have globally registered
 notifier based on irqchip/pin.
 I read this twice but I still don't get your plan. Do you like or
 dislike using EIO for de-coalescing? And how should these notifiers work?

 That's because I confused myself :) I _dislike_ them to be used, but
 since device assignment requires ack notifiers anyway may be it is better
 to introduce one mechanism for device assignmen + de-coalescing instead
 of introducing two different mechanism. Using ack notifiers should be
 easy: RTC registers ack notifier and keep track of delivered interrupts.
 If timer triggers after previews irq was set, but before it was acked
 coalesced counter is incremented. In ack notifier callback coalesced
 counter is checked and if it is not zero new irq is set.
 Ack notifier registrations and event deliveries still need to be routed.
 Piggy-backing this on IRQ messages may be unavoidable for that reason.
 It is done in the kernel without piggy-backing.

As it does not include any IRQ routers in front of the interrupt
controller. Maybe it works for x86, but it is no generic solution.

Also, periodic timer sources get no information about the fact that
their interrupt is masked somewhere along the path to the VCPUs and will
possibly replay countless IRQs when the masking ends, no?

 
 Anyway, I'm going to post my HPET updates with the infrastructure for
 IRQMsg now. Maybe it's helpful to see the other option in reality.

 One other think to consider current approach does not always work.
 Win2K3-64bit-smp and Win2k8-64bit-smp configure RTC interrupt to be
 broadcasted to all cpus, but only boot cpu does time calculation. With
 current approach if interrupt is delivered to at least one vcpu
 it will not be considered coalesced, but if cpu it was delivered to is
 not cpu that does time accounting then clock will drift.

That means we would have to fire callbacks per receiving CPU and report
its number back. Is there a way to find out if we are running such a
guest without an '-enable-win2k[38]-64bit-smp-rtc-drift-fix'?

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-06 Thread Gleb Natapov
On Sun, Jun 06, 2010 at 12:10:07PM +0200, Jan Kiszka wrote:
 Gleb Natapov wrote:
  On Sun, Jun 06, 2010 at 10:07:48AM +0200, Jan Kiszka wrote:
  Gleb Natapov wrote:
  On Sun, Jun 06, 2010 at 09:39:04AM +0200, Jan Kiszka wrote:
  Gleb Natapov wrote:
  On Sat, Jun 05, 2010 at 02:04:01AM +0200, Jan Kiszka wrote:
  I'd like to also support EOI handling. When the guest clears the
  interrupt condtion, the EOI callback would be called. This could occur
  much later than the IRQ delivery time. I'm not sure if we need the
  result code in that case.
 
  If any intermediate device (IOAPIC?) needs to be informed about either
  delivery or EOI also, it could create a proxy message with its
  callbacks in place. But we need then a separate opaque field (in
  addition to payload) to store the original message.
 
  struct IRQMsg {
   DeviceState *src;
   void (*delivery_cb)(IRQMsg *msg, int result);
   void (*eoi_cb)(IRQMsg *msg, int result);
   void *src_opaque;
   void *payload;
  };
  Extending the lifetime of IRQMsg objects beyond the delivery call stack
  means qemu_malloc/free for every delivery. I think it takes a _very_
  appealing reason to justify this. But so far I do not see any use case
  for eio_cb at all.
 
  I dislike use of eoi for reinfecting missing interrupts since
  it eliminates use of internal PIC/APIC queue of not yet delivered
  interrupts. PIC and APIC has internal queue that can handle two 
  elements:
  one is delivered, but not yet acked interrupt in isr and another is
  pending interrupt in irr. Using eoi callback (or ack notifier as it's
  called inside kernel) interrupt will be considered coalesced even if irr
  is cleared, but no ack was received for previously delivered interrupt.
  But ack notifiers actually has another use: device assignment. There is
  a plan to move device assignment from kernel to userspace and for that
  ack notifiers will have to be extended to userspace too. If so we can
  use them to do irq decoalescing as well. I doubt they should be part
  of IRQMsg though. Why not do what kernel does: have globally registered
  notifier based on irqchip/pin.
  I read this twice but I still don't get your plan. Do you like or
  dislike using EIO for de-coalescing? And how should these notifiers work?
 
  That's because I confused myself :) I _dislike_ them to be used, but
  since device assignment requires ack notifiers anyway may be it is better
  to introduce one mechanism for device assignmen + de-coalescing instead
  of introducing two different mechanism. Using ack notifiers should be
  easy: RTC registers ack notifier and keep track of delivered interrupts.
  If timer triggers after previews irq was set, but before it was acked
  coalesced counter is incremented. In ack notifier callback coalesced
  counter is checked and if it is not zero new irq is set.
  Ack notifier registrations and event deliveries still need to be routed.
  Piggy-backing this on IRQ messages may be unavoidable for that reason.
  It is done in the kernel without piggy-backing.
 
 As it does not include any IRQ routers in front of the interrupt
 controller. Maybe it works for x86, but it is no generic solution.
 
x86 has IRQ router in front of interrupt controller inside pci host
bridge.

 Also, periodic timer sources get no information about the fact that
 their interrupt is masked somewhere along the path to the VCPUs and will
 possibly replay countless IRQs when the masking ends, no?
 
Correct, for that we have mask notifiers in the kernel. Gets ugly be the
minute.

  
  Anyway, I'm going to post my HPET updates with the infrastructure for
  IRQMsg now. Maybe it's helpful to see the other option in reality.
 
  One other think to consider current approach does not always work.
  Win2K3-64bit-smp and Win2k8-64bit-smp configure RTC interrupt to be
  broadcasted to all cpus, but only boot cpu does time calculation. With
  current approach if interrupt is delivered to at least one vcpu
  it will not be considered coalesced, but if cpu it was delivered to is
  not cpu that does time accounting then clock will drift.
 
 That means we would have to fire callbacks per receiving CPU and report
 its number back. Is there a way to find out if we are running such a
 guest without an '-enable-win2k[38]-64bit-smp-rtc-drift-fix'?
 
Not that I know of.

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-05 Thread Blue Swirl
On Sat, Jun 5, 2010 at 12:04 AM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov g...@redhat.com wrote:
 On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
 On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
 Gleb Natapov wrote:
 On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
 Blue Swirl wrote:
 But how about if we introduced instead a message based IRQ? Then the
 message could specify the originator device, maybe ACK/coalesce/NACK
 callbacks and a bigger payload than just 1 bit of level. I think that
 could achieve the same coalescing effect as what the bidirectional
 IRQ. The payload could be useful for other purposes, for example
 Sparc64 IRQ messages contain three 64 bit words.
 If there are more users than just IRQ de-coalescing, this indeed sounds
 superior. We could pass objects like this one around:

 struct qemu_irq_msg {
  void (*delivery_cb)(int result);
  void *payload;
 };

 They would be valid within the scope of the IRQ handlers. Whoever
 terminates or actually delivers the IRQ would invoke the callback. And
 platforms like sparc64 could evaluate the additional payload pointer in
 their irqchips or wherever they need it. IRQ routers on platforms that
 make use of these messages would have to replicate them when forwarding
 an event.

 OK?

 Let me see if I understand you correctly. qemu_set_irq() will get
 additional parameter qemu_irq_msg and if irq was not coalesced
 delivery_cb is called, so there is a guaranty that if delivery_cb is
 called it is done before qemu_set_irq() returns. Correct?
 If the side that triggers an IRQ passes a message object with a non-NULL
 callback, it is supposed to be called unconditionally, passing the
 result of the delivery (delivered, masked, coalesced). And yes, the
 callback will be invoked in the context of the irq handler, so before
 qemu_set_irq (or rather some new qemu_set_irq_msg) returns.

 Looks fine to me.

 Except that delivery_cb should probably get pointer to qemu_irq_msg as a
 parameter.

 I'd like to also support EOI handling. When the guest clears the
 interrupt condtion, the EOI callback would be called. This could occur
 much later than the IRQ delivery time. I'm not sure if we need the
 result code in that case.

 If any intermediate device (IOAPIC?) needs to be informed about either
 delivery or EOI also, it could create a proxy message with its
 callbacks in place. But we need then a separate opaque field (in
 addition to payload) to store the original message.

 struct IRQMsg {
  DeviceState *src;
  void (*delivery_cb)(IRQMsg *msg, int result);
  void (*eoi_cb)(IRQMsg *msg, int result);
  void *src_opaque;
  void *payload;
 };

 Extending the lifetime of IRQMsg objects beyond the delivery call stack
 means qemu_malloc/free for every delivery. I think it takes a _very_
 appealing reason to justify this. But so far I do not see any use case
 for eio_cb at all.

I think it's safer to use allocation model anyway because this will be
generic code. For example, an intermediate device may want to queue
the IRQs. Alternatively, the callbacks could use DeviceState and some
opaque which can be used as the callback context:
  void (*delivery_cb)(DeviceState *src, void *src_opaque, int result);

EOI can be added later if needed, QEMU seems to work fine now without
it. But based on IOAPIC data sheet, I'd suppose it should be need to
pass EOI from LAPIC to IOAPIC. It could be used by coalescing as
another opportunity to inject IRQs though I guess the guest will ack
the IRQ at the same time for both RTC and APIC.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-05 Thread Jan Kiszka
Blue Swirl wrote:
 On Sat, Jun 5, 2010 at 12:04 AM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov g...@redhat.com wrote:
 On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
 On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
 Gleb Natapov wrote:
 On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
 Blue Swirl wrote:
 But how about if we introduced instead a message based IRQ? Then the
 message could specify the originator device, maybe ACK/coalesce/NACK
 callbacks and a bigger payload than just 1 bit of level. I think that
 could achieve the same coalescing effect as what the bidirectional
 IRQ. The payload could be useful for other purposes, for example
 Sparc64 IRQ messages contain three 64 bit words.
 If there are more users than just IRQ de-coalescing, this indeed sounds
 superior. We could pass objects like this one around:

 struct qemu_irq_msg {
  void (*delivery_cb)(int result);
  void *payload;
 };

 They would be valid within the scope of the IRQ handlers. Whoever
 terminates or actually delivers the IRQ would invoke the callback. And
 platforms like sparc64 could evaluate the additional payload pointer in
 their irqchips or wherever they need it. IRQ routers on platforms that
 make use of these messages would have to replicate them when forwarding
 an event.

 OK?

 Let me see if I understand you correctly. qemu_set_irq() will get
 additional parameter qemu_irq_msg and if irq was not coalesced
 delivery_cb is called, so there is a guaranty that if delivery_cb is
 called it is done before qemu_set_irq() returns. Correct?
 If the side that triggers an IRQ passes a message object with a non-NULL
 callback, it is supposed to be called unconditionally, passing the
 result of the delivery (delivered, masked, coalesced). And yes, the
 callback will be invoked in the context of the irq handler, so before
 qemu_set_irq (or rather some new qemu_set_irq_msg) returns.

 Looks fine to me.

 Except that delivery_cb should probably get pointer to qemu_irq_msg as a
 parameter.
 I'd like to also support EOI handling. When the guest clears the
 interrupt condtion, the EOI callback would be called. This could occur
 much later than the IRQ delivery time. I'm not sure if we need the
 result code in that case.

 If any intermediate device (IOAPIC?) needs to be informed about either
 delivery or EOI also, it could create a proxy message with its
 callbacks in place. But we need then a separate opaque field (in
 addition to payload) to store the original message.

 struct IRQMsg {
  DeviceState *src;
  void (*delivery_cb)(IRQMsg *msg, int result);
  void (*eoi_cb)(IRQMsg *msg, int result);
  void *src_opaque;
  void *payload;
 };
 Extending the lifetime of IRQMsg objects beyond the delivery call stack
 means qemu_malloc/free for every delivery. I think it takes a _very_
 appealing reason to justify this. But so far I do not see any use case
 for eio_cb at all.
 
 I think it's safer to use allocation model anyway because this will be
 generic code. For example, an intermediate device may want to queue
 the IRQs. Alternatively, the callbacks could use DeviceState and some
 opaque which can be used as the callback context:
   void (*delivery_cb)(DeviceState *src, void *src_opaque, int result);
 
 EOI can be added later if needed, QEMU seems to work fine now without
 it. But based on IOAPIC data sheet, I'd suppose it should be need to
 pass EOI from LAPIC to IOAPIC. It could be used by coalescing as
 another opportunity to inject IRQs though I guess the guest will ack
 the IRQ at the same time for both RTC and APIC.

Let's wait for a real use case for an extended IRQMsg lifetime. For now
we are fine with stack-allocated messages which are way simpler to
handle. I'm already drafting a first prototype based on this model.
Switching to dynamic allocation may still happen later on once the
urgent need shows up.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-05 Thread Blue Swirl
On Sat, Jun 5, 2010 at 8:27 AM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Sat, Jun 5, 2010 at 12:04 AM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov g...@redhat.com wrote:
 On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
 On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
 Gleb Natapov wrote:
 On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
 Blue Swirl wrote:
 But how about if we introduced instead a message based IRQ? Then the
 message could specify the originator device, maybe ACK/coalesce/NACK
 callbacks and a bigger payload than just 1 bit of level. I think that
 could achieve the same coalescing effect as what the bidirectional
 IRQ. The payload could be useful for other purposes, for example
 Sparc64 IRQ messages contain three 64 bit words.
 If there are more users than just IRQ de-coalescing, this indeed 
 sounds
 superior. We could pass objects like this one around:

 struct qemu_irq_msg {
  void (*delivery_cb)(int result);
  void *payload;
 };

 They would be valid within the scope of the IRQ handlers. Whoever
 terminates or actually delivers the IRQ would invoke the callback. And
 platforms like sparc64 could evaluate the additional payload pointer 
 in
 their irqchips or wherever they need it. IRQ routers on platforms that
 make use of these messages would have to replicate them when 
 forwarding
 an event.

 OK?

 Let me see if I understand you correctly. qemu_set_irq() will get
 additional parameter qemu_irq_msg and if irq was not coalesced
 delivery_cb is called, so there is a guaranty that if delivery_cb is
 called it is done before qemu_set_irq() returns. Correct?
 If the side that triggers an IRQ passes a message object with a non-NULL
 callback, it is supposed to be called unconditionally, passing the
 result of the delivery (delivered, masked, coalesced). And yes, the
 callback will be invoked in the context of the irq handler, so before
 qemu_set_irq (or rather some new qemu_set_irq_msg) returns.

 Looks fine to me.

 Except that delivery_cb should probably get pointer to qemu_irq_msg as a
 parameter.
 I'd like to also support EOI handling. When the guest clears the
 interrupt condtion, the EOI callback would be called. This could occur
 much later than the IRQ delivery time. I'm not sure if we need the
 result code in that case.

 If any intermediate device (IOAPIC?) needs to be informed about either
 delivery or EOI also, it could create a proxy message with its
 callbacks in place. But we need then a separate opaque field (in
 addition to payload) to store the original message.

 struct IRQMsg {
  DeviceState *src;
  void (*delivery_cb)(IRQMsg *msg, int result);
  void (*eoi_cb)(IRQMsg *msg, int result);
  void *src_opaque;
  void *payload;
 };
 Extending the lifetime of IRQMsg objects beyond the delivery call stack
 means qemu_malloc/free for every delivery. I think it takes a _very_
 appealing reason to justify this. But so far I do not see any use case
 for eio_cb at all.

 I think it's safer to use allocation model anyway because this will be
 generic code. For example, an intermediate device may want to queue
 the IRQs. Alternatively, the callbacks could use DeviceState and some
 opaque which can be used as the callback context:
   void (*delivery_cb)(DeviceState *src, void *src_opaque, int result);

 EOI can be added later if needed, QEMU seems to work fine now without
 it. But based on IOAPIC data sheet, I'd suppose it should be need to
 pass EOI from LAPIC to IOAPIC. It could be used by coalescing as
 another opportunity to inject IRQs though I guess the guest will ack
 the IRQ at the same time for both RTC and APIC.

 Let's wait for a real use case for an extended IRQMsg lifetime. For now
 we are fine with stack-allocated messages which are way simpler to
 handle. I'm already drafting a first prototype based on this model.
 Switching to dynamic allocation may still happen later on once the
 urgent need shows up.

Passing around stack allocated objects is asking for trouble. I'd much
rather use the DeviceState/opaque version then, so at least
destination should not need to use IRQMsg for anything.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-05 Thread Jan Kiszka
Blue Swirl wrote:
 On Sat, Jun 5, 2010 at 8:27 AM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Sat, Jun 5, 2010 at 12:04 AM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov g...@redhat.com wrote:
 On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
 On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
 Gleb Natapov wrote:
 On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
 Blue Swirl wrote:
 But how about if we introduced instead a message based IRQ? Then the
 message could specify the originator device, maybe ACK/coalesce/NACK
 callbacks and a bigger payload than just 1 bit of level. I think 
 that
 could achieve the same coalescing effect as what the bidirectional
 IRQ. The payload could be useful for other purposes, for example
 Sparc64 IRQ messages contain three 64 bit words.
 If there are more users than just IRQ de-coalescing, this indeed 
 sounds
 superior. We could pass objects like this one around:

 struct qemu_irq_msg {
  void (*delivery_cb)(int result);
  void *payload;
 };

 They would be valid within the scope of the IRQ handlers. Whoever
 terminates or actually delivers the IRQ would invoke the callback. 
 And
 platforms like sparc64 could evaluate the additional payload pointer 
 in
 their irqchips or wherever they need it. IRQ routers on platforms 
 that
 make use of these messages would have to replicate them when 
 forwarding
 an event.

 OK?

 Let me see if I understand you correctly. qemu_set_irq() will get
 additional parameter qemu_irq_msg and if irq was not coalesced
 delivery_cb is called, so there is a guaranty that if delivery_cb is
 called it is done before qemu_set_irq() returns. Correct?
 If the side that triggers an IRQ passes a message object with a 
 non-NULL
 callback, it is supposed to be called unconditionally, passing the
 result of the delivery (delivered, masked, coalesced). And yes, the
 callback will be invoked in the context of the irq handler, so before
 qemu_set_irq (or rather some new qemu_set_irq_msg) returns.

 Looks fine to me.

 Except that delivery_cb should probably get pointer to qemu_irq_msg as a
 parameter.
 I'd like to also support EOI handling. When the guest clears the
 interrupt condtion, the EOI callback would be called. This could occur
 much later than the IRQ delivery time. I'm not sure if we need the
 result code in that case.

 If any intermediate device (IOAPIC?) needs to be informed about either
 delivery or EOI also, it could create a proxy message with its
 callbacks in place. But we need then a separate opaque field (in
 addition to payload) to store the original message.

 struct IRQMsg {
  DeviceState *src;
  void (*delivery_cb)(IRQMsg *msg, int result);
  void (*eoi_cb)(IRQMsg *msg, int result);
  void *src_opaque;
  void *payload;
 };
 Extending the lifetime of IRQMsg objects beyond the delivery call stack
 means qemu_malloc/free for every delivery. I think it takes a _very_
 appealing reason to justify this. But so far I do not see any use case
 for eio_cb at all.
 I think it's safer to use allocation model anyway because this will be
 generic code. For example, an intermediate device may want to queue
 the IRQs. Alternatively, the callbacks could use DeviceState and some
 opaque which can be used as the callback context:
   void (*delivery_cb)(DeviceState *src, void *src_opaque, int result);

 EOI can be added later if needed, QEMU seems to work fine now without
 it. But based on IOAPIC data sheet, I'd suppose it should be need to
 pass EOI from LAPIC to IOAPIC. It could be used by coalescing as
 another opportunity to inject IRQs though I guess the guest will ack
 the IRQ at the same time for both RTC and APIC.
 Let's wait for a real use case for an extended IRQMsg lifetime. For now
 we are fine with stack-allocated messages which are way simpler to
 handle. I'm already drafting a first prototype based on this model.
 Switching to dynamic allocation may still happen later on once the
 urgent need shows up.
 
 Passing around stack allocated objects is asking for trouble. I'd much
 rather use the DeviceState/opaque version then, so at least
 destination should not need to use IRQMsg for anything.

Right, I've hiden the IRQMsg object from the target handler, temporarily
storing it in qemu_irq instead. qemu_irq_handler had to be touched
anyway, so I'm now passing the IRQ object to it so that it can invoke
services to trigger the delivery callback or obtain the payload.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-04 Thread Blue Swirl
On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov g...@redhat.com wrote:
 On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
 On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
  Gleb Natapov wrote:
   On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
   Blue Swirl wrote:
   But how about if we introduced instead a message based IRQ? Then the
   message could specify the originator device, maybe ACK/coalesce/NACK
   callbacks and a bigger payload than just 1 bit of level. I think that
   could achieve the same coalescing effect as what the bidirectional
   IRQ. The payload could be useful for other purposes, for example
   Sparc64 IRQ messages contain three 64 bit words.
   If there are more users than just IRQ de-coalescing, this indeed sounds
   superior. We could pass objects like this one around:
  
   struct qemu_irq_msg {
    void (*delivery_cb)(int result);
    void *payload;
   };
  
   They would be valid within the scope of the IRQ handlers. Whoever
   terminates or actually delivers the IRQ would invoke the callback. And
   platforms like sparc64 could evaluate the additional payload pointer in
   their irqchips or wherever they need it. IRQ routers on platforms that
   make use of these messages would have to replicate them when forwarding
   an event.
  
   OK?
  
   Let me see if I understand you correctly. qemu_set_irq() will get
   additional parameter qemu_irq_msg and if irq was not coalesced
   delivery_cb is called, so there is a guaranty that if delivery_cb is
   called it is done before qemu_set_irq() returns. Correct?
 
  If the side that triggers an IRQ passes a message object with a non-NULL
  callback, it is supposed to be called unconditionally, passing the
  result of the delivery (delivered, masked, coalesced). And yes, the
  callback will be invoked in the context of the irq handler, so before
  qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
 
 Looks fine to me.

 Except that delivery_cb should probably get pointer to qemu_irq_msg as a
 parameter.

I'd like to also support EOI handling. When the guest clears the
interrupt condtion, the EOI callback would be called. This could occur
much later than the IRQ delivery time. I'm not sure if we need the
result code in that case.

If any intermediate device (IOAPIC?) needs to be informed about either
delivery or EOI also, it could create a proxy message with its
callbacks in place. But we need then a separate opaque field (in
addition to payload) to store the original message.

struct IRQMsg {
 DeviceState *src;
 void (*delivery_cb)(IRQMsg *msg, int result);
 void (*eoi_cb)(IRQMsg *msg, int result);
 void *src_opaque;
 void *payload;
};



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-04 Thread Jan Kiszka
Blue Swirl wrote:
 On Thu, Jun 3, 2010 at 7:06 AM, Gleb Natapov g...@redhat.com wrote:
 On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
 On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
 Gleb Natapov wrote:
 On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
 Blue Swirl wrote:
 But how about if we introduced instead a message based IRQ? Then the
 message could specify the originator device, maybe ACK/coalesce/NACK
 callbacks and a bigger payload than just 1 bit of level. I think that
 could achieve the same coalescing effect as what the bidirectional
 IRQ. The payload could be useful for other purposes, for example
 Sparc64 IRQ messages contain three 64 bit words.
 If there are more users than just IRQ de-coalescing, this indeed sounds
 superior. We could pass objects like this one around:

 struct qemu_irq_msg {
  void (*delivery_cb)(int result);
  void *payload;
 };

 They would be valid within the scope of the IRQ handlers. Whoever
 terminates or actually delivers the IRQ would invoke the callback. And
 platforms like sparc64 could evaluate the additional payload pointer in
 their irqchips or wherever they need it. IRQ routers on platforms that
 make use of these messages would have to replicate them when forwarding
 an event.

 OK?

 Let me see if I understand you correctly. qemu_set_irq() will get
 additional parameter qemu_irq_msg and if irq was not coalesced
 delivery_cb is called, so there is a guaranty that if delivery_cb is
 called it is done before qemu_set_irq() returns. Correct?
 If the side that triggers an IRQ passes a message object with a non-NULL
 callback, it is supposed to be called unconditionally, passing the
 result of the delivery (delivered, masked, coalesced). And yes, the
 callback will be invoked in the context of the irq handler, so before
 qemu_set_irq (or rather some new qemu_set_irq_msg) returns.

 Looks fine to me.

 Except that delivery_cb should probably get pointer to qemu_irq_msg as a
 parameter.
 
 I'd like to also support EOI handling. When the guest clears the
 interrupt condtion, the EOI callback would be called. This could occur
 much later than the IRQ delivery time. I'm not sure if we need the
 result code in that case.
 
 If any intermediate device (IOAPIC?) needs to be informed about either
 delivery or EOI also, it could create a proxy message with its
 callbacks in place. But we need then a separate opaque field (in
 addition to payload) to store the original message.
 
 struct IRQMsg {
  DeviceState *src;
  void (*delivery_cb)(IRQMsg *msg, int result);
  void (*eoi_cb)(IRQMsg *msg, int result);
  void *src_opaque;
  void *payload;
 };

Extending the lifetime of IRQMsg objects beyond the delivery call stack
means qemu_malloc/free for every delivery. I think it takes a _very_
appealing reason to justify this. But so far I do not see any use case
for eio_cb at all.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-03 Thread Jan Kiszka
Blue Swirl wrote:
 But how about if we introduced instead a message based IRQ? Then the
 message could specify the originator device, maybe ACK/coalesce/NACK
 callbacks and a bigger payload than just 1 bit of level. I think that
 could achieve the same coalescing effect as what the bidirectional
 IRQ. The payload could be useful for other purposes, for example
 Sparc64 IRQ messages contain three 64 bit words.

If there are more users than just IRQ de-coalescing, this indeed sounds
superior. We could pass objects like this one around:

struct qemu_irq_msg {
void (*delivery_cb)(int result);
void *payload;
};

They would be valid within the scope of the IRQ handlers. Whoever
terminates or actually delivers the IRQ would invoke the callback. And
platforms like sparc64 could evaluate the additional payload pointer in
their irqchips or wherever they need it. IRQ routers on platforms that
make use of these messages would have to replicate them when forwarding
an event.

OK?

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-03 Thread Gleb Natapov
On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
 Blue Swirl wrote:
  But how about if we introduced instead a message based IRQ? Then the
  message could specify the originator device, maybe ACK/coalesce/NACK
  callbacks and a bigger payload than just 1 bit of level. I think that
  could achieve the same coalescing effect as what the bidirectional
  IRQ. The payload could be useful for other purposes, for example
  Sparc64 IRQ messages contain three 64 bit words.
 
 If there are more users than just IRQ de-coalescing, this indeed sounds
 superior. We could pass objects like this one around:
 
 struct qemu_irq_msg {
   void (*delivery_cb)(int result);
   void *payload;
 };
 
 They would be valid within the scope of the IRQ handlers. Whoever
 terminates or actually delivers the IRQ would invoke the callback. And
 platforms like sparc64 could evaluate the additional payload pointer in
 their irqchips or wherever they need it. IRQ routers on platforms that
 make use of these messages would have to replicate them when forwarding
 an event.
 
 OK?
 
Let me see if I understand you correctly. qemu_set_irq() will get
additional parameter qemu_irq_msg and if irq was not coalesced
delivery_cb is called, so there is a guaranty that if delivery_cb is
called it is done before qemu_set_irq() returns. Correct?

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-03 Thread Jan Kiszka
Gleb Natapov wrote:
 On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
 Blue Swirl wrote:
 But how about if we introduced instead a message based IRQ? Then the
 message could specify the originator device, maybe ACK/coalesce/NACK
 callbacks and a bigger payload than just 1 bit of level. I think that
 could achieve the same coalescing effect as what the bidirectional
 IRQ. The payload could be useful for other purposes, for example
 Sparc64 IRQ messages contain three 64 bit words.
 If there are more users than just IRQ de-coalescing, this indeed sounds
 superior. We could pass objects like this one around:

 struct qemu_irq_msg {
  void (*delivery_cb)(int result);
  void *payload;
 };

 They would be valid within the scope of the IRQ handlers. Whoever
 terminates or actually delivers the IRQ would invoke the callback. And
 platforms like sparc64 could evaluate the additional payload pointer in
 their irqchips or wherever they need it. IRQ routers on platforms that
 make use of these messages would have to replicate them when forwarding
 an event.

 OK?

 Let me see if I understand you correctly. qemu_set_irq() will get
 additional parameter qemu_irq_msg and if irq was not coalesced
 delivery_cb is called, so there is a guaranty that if delivery_cb is
 called it is done before qemu_set_irq() returns. Correct?

If the side that triggers an IRQ passes a message object with a non-NULL
callback, it is supposed to be called unconditionally, passing the
result of the delivery (delivered, masked, coalesced). And yes, the
callback will be invoked in the context of the irq handler, so before
qemu_set_irq (or rather some new qemu_set_irq_msg) returns.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-03 Thread Gleb Natapov
On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
 Gleb Natapov wrote:
  On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
  Blue Swirl wrote:
  But how about if we introduced instead a message based IRQ? Then the
  message could specify the originator device, maybe ACK/coalesce/NACK
  callbacks and a bigger payload than just 1 bit of level. I think that
  could achieve the same coalescing effect as what the bidirectional
  IRQ. The payload could be useful for other purposes, for example
  Sparc64 IRQ messages contain three 64 bit words.
  If there are more users than just IRQ de-coalescing, this indeed sounds
  superior. We could pass objects like this one around:
 
  struct qemu_irq_msg {
 void (*delivery_cb)(int result);
 void *payload;
  };
 
  They would be valid within the scope of the IRQ handlers. Whoever
  terminates or actually delivers the IRQ would invoke the callback. And
  platforms like sparc64 could evaluate the additional payload pointer in
  their irqchips or wherever they need it. IRQ routers on platforms that
  make use of these messages would have to replicate them when forwarding
  an event.
 
  OK?
 
  Let me see if I understand you correctly. qemu_set_irq() will get
  additional parameter qemu_irq_msg and if irq was not coalesced
  delivery_cb is called, so there is a guaranty that if delivery_cb is
  called it is done before qemu_set_irq() returns. Correct?
 
 If the side that triggers an IRQ passes a message object with a non-NULL
 callback, it is supposed to be called unconditionally, passing the
 result of the delivery (delivered, masked, coalesced). And yes, the
 callback will be invoked in the context of the irq handler, so before
 qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
 
Looks fine to me.

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-03 Thread Gleb Natapov
On Thu, Jun 03, 2010 at 10:03:00AM +0300, Gleb Natapov wrote:
 On Thu, Jun 03, 2010 at 08:59:23AM +0200, Jan Kiszka wrote:
  Gleb Natapov wrote:
   On Thu, Jun 03, 2010 at 08:23:46AM +0200, Jan Kiszka wrote:
   Blue Swirl wrote:
   But how about if we introduced instead a message based IRQ? Then the
   message could specify the originator device, maybe ACK/coalesce/NACK
   callbacks and a bigger payload than just 1 bit of level. I think that
   could achieve the same coalescing effect as what the bidirectional
   IRQ. The payload could be useful for other purposes, for example
   Sparc64 IRQ messages contain three 64 bit words.
   If there are more users than just IRQ de-coalescing, this indeed sounds
   superior. We could pass objects like this one around:
  
   struct qemu_irq_msg {
void (*delivery_cb)(int result);
void *payload;
   };
  
   They would be valid within the scope of the IRQ handlers. Whoever
   terminates or actually delivers the IRQ would invoke the callback. And
   platforms like sparc64 could evaluate the additional payload pointer in
   their irqchips or wherever they need it. IRQ routers on platforms that
   make use of these messages would have to replicate them when forwarding
   an event.
  
   OK?
  
   Let me see if I understand you correctly. qemu_set_irq() will get
   additional parameter qemu_irq_msg and if irq was not coalesced
   delivery_cb is called, so there is a guaranty that if delivery_cb is
   called it is done before qemu_set_irq() returns. Correct?
  
  If the side that triggers an IRQ passes a message object with a non-NULL
  callback, it is supposed to be called unconditionally, passing the
  result of the delivery (delivered, masked, coalesced). And yes, the
  callback will be invoked in the context of the irq handler, so before
  qemu_set_irq (or rather some new qemu_set_irq_msg) returns.
  
 Looks fine to me.
 
Except that delivery_cb should probably get pointer to qemu_irq_msg as a
parameter.

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-02 Thread Blue Swirl
On Tue, Jun 1, 2010 at 6:30 PM, Gleb Natapov g...@redhat.com wrote:
 On Tue, Jun 01, 2010 at 06:00:20PM +, Blue Swirl wrote:
   Looks like irr in apic is never cleared. Probably bug in userspace apic
   emulation. I'll look into it. Try to route interrupt via APIC (not 
   ExtInt),
   or use qemu-kvm with in kernel irq chip.
 
  With APIC you mean Fixed? Then the IRQ is not delivered at all.
  You need to deliver it through IOAPIC.

 In this version, when USE_APIC is defined, IRQs are routed via IOAPIC
 and APIC, PIC interrupts are disabled. Otherwise, IOAPIC and APIC are
 left to reset state and PIC is used.

 With the PIC version, there are bogus injections. With IOAPIC and APIC
 dots come at 2Hz. -enable-kvm has no effect.

 I looked into this briefly. Virtual wire case is not handled correctly,
 and pic lacks coalescing detection at all (the reason BTW because proper
 solution was rejected, so only hack required by Windows was added).
 Windows routes RTC interrupts through IOAPIC and this case works as you
 can see. -enable-kvm on qemu should not make any difference since it
 uses the same userspace pic/ioapic/apic code as TCG. It would be
 interesting to try with qemu-kvm which uses in-kernel irq chips by
 default.

 BTW I managed to compile you test program with gcc 4.2. 4.4 and 4.3
 produced non-working binaries for me.

This exercise showed me that the IRQ connectivity between APIC and RTC
can vary a lot because of guest activities. So it may be impossible to
know at APIC if some IRQ came from RTC and so the coalescing must be
accounted at RTC (or close to RTC). This makes a generic solution less
useful, I think the net result would be to refactor
rtc_coalesced_timer*() to a new file. Also my generic approach of time
base adjustment would need to be steered by RTC because APIC can't. So
I think I'll have to abandon this idea for now. Sorry for the delay,
thanks for the interesting discussion though.

Coming back to $SUBJECT: I don't think the bidirectional IRQ is a good
solution either, but it may be slightly better than current hack (I
didn't realize the full ugliness of the hack before, namely that APIC
and RTC are really not connected at all). The same effect could be
achieved with two signals going to opposite directions, like
APICREQ/APICACK on 82093 IOAPIC but again, if any IRQs point to same
vector at IOAPIC, there is no way to get back to originator. With
bidirectional IRQs, there's a one-shot chance at the delivery time,
but not later at ACK/EOI time.

But how about if we introduced instead a message based IRQ? Then the
message could specify the originator device, maybe ACK/coalesce/NACK
callbacks and a bigger payload than just 1 bit of level. I think that
could achieve the same coalescing effect as what the bidirectional
IRQ. The payload could be useful for other purposes, for example
Sparc64 IRQ messages contain three 64 bit words.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-01 Thread Blue Swirl
On Mon, May 31, 2010 at 5:19 AM, Gleb Natapov g...@redhat.com wrote:
 On Sun, May 30, 2010 at 08:21:30PM +, Blue Swirl wrote:
 On Sun, May 30, 2010 at 8:07 PM, Gleb Natapov g...@redhat.com wrote:
  On Sun, May 30, 2010 at 07:37:59PM +, Blue Swirl wrote:
  On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov g...@redhat.com wrote:
   On Sun, May 30, 2010 at 12:56:26PM +, Blue Swirl wrote:
Well, I'd like to get the test program also trigger it. Now I'm 
getting:
apic: write: 0350 = 
apic: apic_reset_irq_delivered: old coalescing 0
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 1
apic: apic_get_irq_delivered: returning coalescing 1
apic: apic_reset_irq_delivered: old coalescing 1
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 0
apic: apic_get_irq_delivered: returning coalescing 0
apic: apic_reset_irq_delivered: old coalescing 0
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 0
   
   So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
   0 it means the interrupt was not delivered.
 
  That seems strange. I changed the program so that the handler gets
  executed, also output a dot to serial from the handler. I changed the
  frequency to 2Hz.
 
  Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
  With -rtc-td-hack, the dots come out much faster. I added
  DEBUG_COALESCING also to RTC, with that enabled I get:
  qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack 
  -serial stdio
  cmos: coalesced irqs scaled to 0
  cmos: coalesced irqs increased to 1
  cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  ..cmos: injecting from timer
  .cmos: coalesced irqs increased to 2
  cmos: injecting on ack
 
  So, there are bogus injections.
 
  Looks like irr in apic is never cleared. Probably bug in userspace apic
  emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
  or use qemu-kvm with in kernel irq chip.

 With APIC you mean Fixed? Then the IRQ is not delivered at all.
 You need to deliver it through IOAPIC.

In this version, when USE_APIC is defined, IRQs are routed via IOAPIC
and APIC, PIC interrupts are disabled. Otherwise, IOAPIC and APIC are
left to reset state and PIC is used.

With the PIC version, there are bogus injections. With IOAPIC and APIC
dots come at 2Hz. -enable-kvm has no effect.


coalescing.S
Description: Binary data


coalescing_apic.bin.bz2
Description: BZip2 compressed data


coalescing_pic.bin.bz2
Description: BZip2 compressed data


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-06-01 Thread Gleb Natapov
On Tue, Jun 01, 2010 at 06:00:20PM +, Blue Swirl wrote:
   Looks like irr in apic is never cleared. Probably bug in userspace apic
   emulation. I'll look into it. Try to route interrupt via APIC (not 
   ExtInt),
   or use qemu-kvm with in kernel irq chip.
 
  With APIC you mean Fixed? Then the IRQ is not delivered at all.
  You need to deliver it through IOAPIC.
 
 In this version, when USE_APIC is defined, IRQs are routed via IOAPIC
 and APIC, PIC interrupts are disabled. Otherwise, IOAPIC and APIC are
 left to reset state and PIC is used.
 
 With the PIC version, there are bogus injections. With IOAPIC and APIC
 dots come at 2Hz. -enable-kvm has no effect.

I looked into this briefly. Virtual wire case is not handled correctly,
and pic lacks coalescing detection at all (the reason BTW because proper
solution was rejected, so only hack required by Windows was added).
Windows routes RTC interrupts through IOAPIC and this case works as you
can see. -enable-kvm on qemu should not make any difference since it
uses the same userspace pic/ioapic/apic code as TCG. It would be
interesting to try with qemu-kvm which uses in-kernel irq chips by
default.

BTW I managed to compile you test program with gcc 4.2. 4.4 and 4.3
produced non-working binaries for me.

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-31 Thread Jan Kiszka
Blue Swirl wrote:
 On Sun, May 30, 2010 at 12:24 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 Linux don't use RTC as time source and I don't know about BSD, so no
 Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
 heavy load on the host. You can run video inside the gust to trigger
 coalescing more easily.
 I don't have Windows XP, sorry.
 ReactOS [1], at least its 32-bit version, appears to use the RTC as well.
 
 I tried LiveCD and QEMU versions, both seem to hang at boot. Is that expected?

Live-CD works for me (in contrast to the pre-installed image in KVM
mode), but it doesn't show the desired RTC usage.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Gleb Natapov
On Sat, May 29, 2010 at 09:21:14PM +, Blue Swirl wrote:
 On Sat, May 29, 2010 at 4:37 PM, Gleb Natapov g...@redhat.com wrote:
  On Sat, May 29, 2010 at 04:13:22PM +, Blue Swirl wrote:
  On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov g...@redhat.com wrote:
   On Sat, May 29, 2010 at 09:35:30AM +, Blue Swirl wrote:
I still don't see how the alternative is supposed to simplify our life
or improve the efficiency of the de-coalescing workaround. It's rather
problematic like Gleb pointed out: The de-coalescing logic needs to be
informed about periodicity changes that can only be delivered along
IRQs. So what to do with the backlog when the timer is stopped?
  
   What happens with the current design? Gleb only mentioned the
   frequency change, I thought that was not so big problem. But I don't
   think this case should be allowed happen at all, it can't exist on
   real HW.
  
   Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
   inside QEMU, connect with gdb to QEMU process and check what frequency
   RTC configured with (hint: it will be 64Hz), now run video inside the
   guest and check frequency again (hint: it will be 1Khz).
 
  You missed the key word 'stopped'. If the timer is really stopped, no
  IRQs should ever come out afterwards, just like on real HW. For the
  emulation, this means loss of ticks which should have been delivered
  before the change.
 
  I haven't missed it. I describe to you reality of the situation. You want
  to change reality to be more close to what you want it to be by adding
  words to my description.
 
 Quoting Jan: 'So what to do with the backlog when the timer is
 stopped?' I didn't add any words to your description, please be more
 careful with your attributions. Why do you think I want to change the
 reality?
Please refer to my words when you answer to my quote. You quoted my
answer to you statement:
 Gleb only mentioned the frequency change, I thought that was not so big
 problem. But I don't think this case should be allowed happen at all,
 it can't exist on real HW.
No 'stopped' was under discussion nowhere. FWIW 'stopped' is just a case
of frequency change.

 
 XP frequency change isn't the same case as timer being stopped.
 
And what is the big difference exactly?

  Please just go write code, experiment, debug
  and _then_ come here with design.
 
 I added some debugging to RTC, PIC and APIC. I also built a small
 guest in x86 assembly to test the coalescing. However, in the tests
 with this guest and others I noticed that the coalescing only happens
 in some obscure conditions.
So try with real guest and with real load.

 
 By default the APIC's delivery method for IRQs is ExtInt and
 coalescing counting happens only with Fixed. This means that the guest
 needs to reprogram APIC. It also looks like RTC interrupts need to be
 triggered. But I didn't see both of these to happen simultaneously in
 my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
 must be used and I also disabled HPET to be sure that RTC would be
 used.
 
 With DEBUG_COALESCING enabled, I just get increasing numbers for
 apic_irq_delivered:
 apic: apic_set_irq: coalescing 67123
 apic: apic_set_irq: coalescing 67124
 apic: apic_set_irq: coalescing 67125
So have you actually used -rtc-td-hack option? I compiled head of
qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
and I get:
apic: apic_reset_irq_delivered: old coalescing 3
apic: apic_set_irq: coalescing 1
apic: apic_get_irq_delivered: returning coalescing 1
apic: apic_set_irq: coalescing 2
apic: apic_set_irq: coalescing 3
apic: apic_set_irq: coalescing 4
apic: apic_set_irq: coalescing 5
apic: apic_set_irq: coalescing 6
apic: apic_reset_irq_delivered: old coalescing 6
apic: apic_set_irq: coalescing 1
apic: apic_get_irq_delivered: returning coalescing 1

 
 If the hack were active, the numbers would be close to zero (or at
 least some point) because apic_reset_irq_delivered would be called,
 but this does not happen. Could you specify a clear test case with
 which the coalescing action could be tested? Linux or BSD based,
 please.
Linux don't use RTC as time source and I don't know about BSD, so no
Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
heavy load on the host. You can run video inside the gust to trigger
coalescing more easily.

 
  But what if the guest changed the frequency very often, and between
  changes used zero value, like 64Hz - 0Hz - 128Hz - 0Hz - 64Hz?
  Too bad, the world is not perfect.
 
  --
                         Gleb.
 

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Blue Swirl
2010/5/30 Gleb Natapov g...@redhat.com:
 On Sat, May 29, 2010 at 08:52:34PM +, Blue Swirl wrote:
 On Sat, May 29, 2010 at 4:32 PM, Gleb Natapov g...@redhat.com wrote:
  On Sat, May 29, 2010 at 04:03:22PM +, Blue Swirl wrote:
  2010/5/29 Gleb Natapov g...@redhat.com:
   On Sat, May 29, 2010 at 09:15:11AM +, Blue Swirl wrote:
There is no code, because we're still at architecture design stage.
   
Try to write test code to understand the problem better.
  
   I will.
  
   Please do ASAP. This discussion shows that you don't understand what is 
   the
   problem that we are dialing with.
 
  Which part of the problem you think I don't understand?
 
  It seams to me you don't understand how Windows uses RTC for time
  keeping and how the QEMU solves the problem today.

 RTC causes periodic interrupts and Windows interrupt handler
 increments jiffies, like Linux?

 Linux does much more complicated things than that to keep time, so the
 only way to fix time drift in Linux was to introduce pvclock. For Window
 it is not so accurate too, since Windows can change clock frequency any
 time it can't calculate time from jiffies, it needs to update clock at
 each time tick.

  guests could also be assisted with special handling (like 
  win2k
  install hack), for example guest instructions could be counted
  (approximately, for example using TB size or TSC) and only 
  inject
  after at least N instructions have passed.
  Guest instructions cannot be easily counted in KVM (it can be 
  done more
  or less reliably using perf counters, may be).

 Aren't there any debug registers or perf counters, which can 
 generate
 an interrupt after some number of instructions have been 
 executed?
 Don't think debug registers have something like that and they are
 available for guest use anyway. Perf counters differs greatly 
 from CPU
 to CPU (even between two CPUs of the same manufacturer), and we 
 want to
 keep using them for profiling guests. And I don't see what 
 problem it
 will solve anyway that can be solved by simple delay between irq
 reinjection.
   
This would allow counting the executed instructions and limit it. 
Thus
we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
   
Why would you want to limit number of instruction executed by guest 
if
CPU has nothing else to do anyway? The problem occurs not when we 
have
spare cycles so give to a guest, but in opposite case.
  
   I think one problem is that the guest has executed too much compared
   to what would happen with real HW with a lesser CPU. That explains the
   RTC frequency reprogramming case.
   You think wrong. The problem is exactly opposite: the guest haven't
   had enough execution time between two time interrupts. I don't know what
   RTC frequency reprogramming case you are talking about here.
 
  The case you told me where N pending tick IRQs exist but the guest
  wants to change the RTC frequency from 64Hz to 1024Hz.
 
  Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
  10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
  the frequency to 1000Hz.
 
  The problem for emulation is that for the same 3 ticks, there has been
  so little execution power that the ticks have been coalesced. But
  isn't the guest cycle count then much lower than 30Mcyc?
 
  Isn't it so that the guest must be above 30Mcyc to be able to want the
  change? But if we reach that point,  the problem must have not been
  too little execution time, but too much.
 
  Sorry I tried hard to understand what have you said above but failed.
  What do you mean to be able to want the change? Guest sometimes wants
  to get 64 timer interrupts per second and sometimes it wants to get 1024
  timer interrupt per second. It wants it not as a result of time drift or
  anything. It's just how guest behaves. You seams to be to fixated on
  guest frequency change. It's just something you have to take into
  account when you reinject interrupts.

 I meant that in the scenario, the guest won't change the RTC before
 30Mcyc because of some built in determinism in the guest. At that
 point, because of some reason, the change would happen.

 I still don't understand what are you trying to say here. Guest changes
 frequency because of some even in the guest. It is totally independent
 of what happens in QEMUs RTC emulation.

I'm trying to understand the order of events. In the scenario, the
order of events on real HW would be:
10Mcyc: tick IRQ 1
20Mcyc: tick IRQ 2
30Mcyc: tick IRQ 3
30Mcyc: reprogram timer
31Mcyc: tick IRQ 4
32Mcyc: tick IRQ 5
33Mcyc: tick IRQ 6
34Mcyc: tick IRQ 7

With QEMU, the order could become:
30Mcyc: reprogram timer
30.5Mcyc: tick IRQ 1
31Mcyc: tick IRQ 2
31.5Mcyc: tick IRQ 3
32Mcyc: tick IRQ 4
32.5Mcyc: tick IRQ 5
33Mcyc: tick IRQ 6
34Mcyc: tick IRQ 7

Correct?


 
  
  
   

 
   

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Avi Kivity

On 05/28/2010 01:19 AM, Jan Kiszka wrote:


Still, this does not answer:

- How do you want to detect lost timer ticks?

   


Your (and Gleb's) approach: during injection
An alternative: during guest ack

The normal pattern is inj/ack/inj/ack; if we see inj/inj/inj we know the 
guest isn't keeping up.



- What subsystem(s) keeps track of the backlog?
   


Clearly, the clock chip (hpet/rtc/pit).

With the alternative approach, the clock emulation requests that acks be 
reported via a qemu_irq interface.



- And depending on the above: How to detect at all that a specific IRQ
   is a timer tick?
   


Clearly, a blocker unless the clock is responsible for timekeeping.

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Avi Kivity

On 05/29/2010 12:15 PM, Blue Swirl wrote:



This would allow counting the executed instructions and limit it. Thus
we could emulate a 500MHz CPU on a 2GHz CPU more accurately.

   

Why would you want to limit number of instruction executed by guest if
CPU has nothing else to do anyway? The problem occurs not when we have
spare cycles so give to a guest, but in opposite case.
 

I think one problem is that the guest has executed too much compared
to what would happen with real HW with a lesser CPU. That explains the
RTC frequency reprogramming case.
   


The root cause is that while qemu is scheduled out, the real time clock 
keeps ticking.  Since we can't stop real time, we must compensate in 
other ways.



So write the code and show us. You haven't show any evidence that RTC is
the wrong place. RTC knows when interrupt was acknowledge to RTC, it
know when clock frequency changes, it know when device reset happened.
APIC knows only that interrupt was coalesced. It doesn't even know that
it may be masked by a guest in IOAPIC (interrupts delivered while they
are masked not considered coalesced).
 

Oh, I thought interrupt masking was the reason for coalescing! What
exactly is the reason then?
   


The above.

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Blue Swirl
2010/5/30 Gleb Natapov g...@redhat.com:
 On Sat, May 29, 2010 at 09:21:14PM +, Blue Swirl wrote:
 On Sat, May 29, 2010 at 4:37 PM, Gleb Natapov g...@redhat.com wrote:
  On Sat, May 29, 2010 at 04:13:22PM +, Blue Swirl wrote:
  On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov g...@redhat.com wrote:
   On Sat, May 29, 2010 at 09:35:30AM +, Blue Swirl wrote:
I still don't see how the alternative is supposed to simplify our 
life
or improve the efficiency of the de-coalescing workaround. It's 
rather
problematic like Gleb pointed out: The de-coalescing logic needs to 
be
informed about periodicity changes that can only be delivered along
IRQs. So what to do with the backlog when the timer is stopped?
  
   What happens with the current design? Gleb only mentioned the
   frequency change, I thought that was not so big problem. But I don't
   think this case should be allowed happen at all, it can't exist on
   real HW.
  
   Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
   inside QEMU, connect with gdb to QEMU process and check what frequency
   RTC configured with (hint: it will be 64Hz), now run video inside the
   guest and check frequency again (hint: it will be 1Khz).
 
  You missed the key word 'stopped'. If the timer is really stopped, no
  IRQs should ever come out afterwards, just like on real HW. For the
  emulation, this means loss of ticks which should have been delivered
  before the change.
 
  I haven't missed it. I describe to you reality of the situation. You want
  to change reality to be more close to what you want it to be by adding
  words to my description.

 Quoting Jan: 'So what to do with the backlog when the timer is
 stopped?' I didn't add any words to your description, please be more
 careful with your attributions. Why do you think I want to change the
 reality?
 Please refer to my words when you answer to my quote. You quoted my
 answer to you statement:
  Gleb only mentioned the frequency change, I thought that was not so big
  problem. But I don't think this case should be allowed happen at all,
  it can't exist on real HW.

With 'this case' I was referring to 'case with timer stopped', not
'case which Gleb mentioned'.

 No 'stopped' was under discussion nowhere.

It's clearly written there in the sentence Jan wrote.

 FWIW 'stopped' is just a case
 of frequency change.

True.



 XP frequency change isn't the same case as timer being stopped.

 And what is the big difference exactly?

Because after the timer is stopped, its extremely unrealistic to send
any IRQs. Whereas if the frequency is changed to some other nonzero
value, we can cheat and inject some more queued IRQs.

Anyway, if this case is not interesting because it doesn't happen in
real life emulation scenarios, we can forget it no matter how buggy
the current QEMU implementation is.

  Please just go write code, experiment, debug
  and _then_ come here with design.

 I added some debugging to RTC, PIC and APIC. I also built a small
 guest in x86 assembly to test the coalescing. However, in the tests
 with this guest and others I noticed that the coalescing only happens
 in some obscure conditions.
 So try with real guest and with real load.

Well, I'd like to get the test program also trigger it. Now I'm getting:
apic: write: 0350 = 
apic: apic_reset_irq_delivered: old coalescing 0
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 1
apic: apic_get_irq_delivered: returning coalescing 1
apic: apic_reset_irq_delivered: old coalescing 1
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 0
apic: apic_get_irq_delivered: returning coalescing 0
apic: apic_reset_irq_delivered: old coalescing 0
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 0

It looks like some other IRQs cause the coalescing, because also
looking at RTC code, it seems it's not possible for RTC to raise the
IRQ (except update IRQ, alarm etc.) without calling
apic_reset_irq_delivered().

I've attached my test program. Compile:
gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
coalescing.ld -g  objcopy -Obinary coalescing coalescing.bin

Run:
qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack


 By default the APIC's delivery method for IRQs is ExtInt and
 coalescing counting happens only with Fixed. This means that the guest
 needs to reprogram APIC. It also looks like RTC interrupts need to be
 triggered. But I didn't see both of these to happen simultaneously in
 my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
 must be used and I also disabled HPET to be sure that RTC would be
 used.

 With DEBUG_COALESCING enabled, I just get increasing numbers for
 apic_irq_delivered:
 apic: apic_set_irq: coalescing 67123
 apic: apic_set_irq: coalescing 67124
 apic: apic_set_irq: coalescing 67125
 So have you actually used -rtc-td-hack option? I compiled head of

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Jan Kiszka
Blue Swirl wrote:
 Linux don't use RTC as time source and I don't know about BSD, so no
 Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
 heavy load on the host. You can run video inside the gust to trigger
 coalescing more easily.
 
 I don't have Windows XP, sorry.

ReactOS [1], at least its 32-bit version, appears to use the RTC as well.

Jan

[1] http://www.reactos.org/de/download.html




signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Gleb Natapov
On Sun, May 30, 2010 at 12:10:16PM +, Blue Swirl wrote:
   You missed the key word 'stopped'. If the timer is really stopped, no
   IRQs should ever come out afterwards, just like on real HW. For the
   emulation, this means loss of ticks which should have been delivered
   before the change.
  
   I haven't missed it. I describe to you reality of the situation. You want
   to change reality to be more close to what you want it to be by adding
   words to my description.
 
  Quoting Jan: 'So what to do with the backlog when the timer is
  stopped?' I didn't add any words to your description, please be more
  careful with your attributions. Why do you think I want to change the
  reality?
  Please refer to my words when you answer to my quote. You quoted my
  answer to you statement:
   Gleb only mentioned the frequency change, I thought that was not so big
   problem. But I don't think this case should be allowed happen at all,
   it can't exist on real HW.
 
 With 'this case' I was referring to 'case with timer stopped', not
 'case which Gleb mentioned'.
 
  No 'stopped' was under discussion nowhere.
 
 It's clearly written there in the sentence Jan wrote.
 
Jan, not me, but lets leave this topic alone since you agree that
stopped is just a case of frequency change anyway.

  FWIW 'stopped' is just a case
  of frequency change.
 
 True.
 
 
 
  XP frequency change isn't the same case as timer being stopped.
 
  And what is the big difference exactly?
 
 Because after the timer is stopped, its extremely unrealistic to send
 any IRQs. Whereas if the frequency is changed to some other nonzero
 value, we can cheat and inject some more queued IRQs.
 
Correct, when gets disables clock source (by reset or any other means)
coalesced backlog should be forgotten.

 Anyway, if this case is not interesting because it doesn't happen in
 real life emulation scenarios, we can forget it no matter how buggy
 the current QEMU implementation is.
 
   Please just go write code, experiment, debug
   and _then_ come here with design.
 
  I added some debugging to RTC, PIC and APIC. I also built a small
  guest in x86 assembly to test the coalescing. However, in the tests
  with this guest and others I noticed that the coalescing only happens
  in some obscure conditions.
  So try with real guest and with real load.
 
 Well, I'd like to get the test program also trigger it. Now I'm getting:
 apic: write: 0350 = 
 apic: apic_reset_irq_delivered: old coalescing 0
 apic: apic_local_deliver: vector 3 delivery mode 0
 apic: apic_set_irq: coalescing 1
 apic: apic_get_irq_delivered: returning coalescing 1
 apic: apic_reset_irq_delivered: old coalescing 1
 apic: apic_local_deliver: vector 3 delivery mode 0
 apic: apic_set_irq: coalescing 0
 apic: apic_get_irq_delivered: returning coalescing 0
 apic: apic_reset_irq_delivered: old coalescing 0
 apic: apic_local_deliver: vector 3 delivery mode 0
 apic: apic_set_irq: coalescing 0
 
 It looks like some other IRQs cause the coalescing, because also
 looking at RTC code, it seems it's not possible for RTC to raise the
 IRQ (except update IRQ, alarm etc.) without calling
 apic_reset_irq_delivered().
 
 I've attached my test program. Compile:
 gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
 coalescing.ld -g  objcopy -Obinary coalescing coalescing.bin
 
 Run:
 qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack
 
The application does not work for me. Looks like it fails to enter
protected mode. $pc jumps from 0xfff0 to 0x000f003e
and back.

 
  By default the APIC's delivery method for IRQs is ExtInt and
  coalescing counting happens only with Fixed. This means that the guest
  needs to reprogram APIC. It also looks like RTC interrupts need to be
  triggered. But I didn't see both of these to happen simultaneously in
  my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
  must be used and I also disabled HPET to be sure that RTC would be
  used.
 
  With DEBUG_COALESCING enabled, I just get increasing numbers for
  apic_irq_delivered:
  apic: apic_set_irq: coalescing 67123
  apic: apic_set_irq: coalescing 67124
  apic: apic_set_irq: coalescing 67125
  So have you actually used -rtc-td-hack option? I compiled head of
  qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
  and I get:
  apic: apic_reset_irq_delivered: old coalescing 3
  apic: apic_set_irq: coalescing 1
  apic: apic_get_irq_delivered: returning coalescing 1
  apic: apic_set_irq: coalescing 2
  apic: apic_set_irq: coalescing 3
  apic: apic_set_irq: coalescing 4
  apic: apic_set_irq: coalescing 5
  apic: apic_set_irq: coalescing 6
  apic: apic_reset_irq_delivered: old coalescing 6
  apic: apic_set_irq: coalescing 1
  apic: apic_get_irq_delivered: returning coalescing 1
 
 
  If the hack were active, the numbers would be close to zero (or at
  least some point) because apic_reset_irq_delivered would be called,
  but this does not happen. 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Blue Swirl
2010/5/30 Gleb Natapov g...@redhat.com:
 On Sun, May 30, 2010 at 12:10:16PM +, Blue Swirl wrote:
   You missed the key word 'stopped'. If the timer is really stopped, no
   IRQs should ever come out afterwards, just like on real HW. For the
   emulation, this means loss of ticks which should have been delivered
   before the change.
  
   I haven't missed it. I describe to you reality of the situation. You 
   want
   to change reality to be more close to what you want it to be by adding
   words to my description.
 
  Quoting Jan: 'So what to do with the backlog when the timer is
  stopped?' I didn't add any words to your description, please be more
  careful with your attributions. Why do you think I want to change the
  reality?
  Please refer to my words when you answer to my quote. You quoted my
  answer to you statement:
   Gleb only mentioned the frequency change, I thought that was not so big
   problem. But I don't think this case should be allowed happen at all,
   it can't exist on real HW.

 With 'this case' I was referring to 'case with timer stopped', not
 'case which Gleb mentioned'.

  No 'stopped' was under discussion nowhere.

 It's clearly written there in the sentence Jan wrote.

 Jan, not me, but lets leave this topic alone since you agree that
 stopped is just a case of frequency change anyway.

  FWIW 'stopped' is just a case
  of frequency change.

 True.

 
 
  XP frequency change isn't the same case as timer being stopped.
 
  And what is the big difference exactly?

 Because after the timer is stopped, its extremely unrealistic to send
 any IRQs. Whereas if the frequency is changed to some other nonzero
 value, we can cheat and inject some more queued IRQs.

 Correct, when gets disables clock source (by reset or any other means)
 coalesced backlog should be forgotten.

 Anyway, if this case is not interesting because it doesn't happen in
 real life emulation scenarios, we can forget it no matter how buggy
 the current QEMU implementation is.

   Please just go write code, experiment, debug
   and _then_ come here with design.
 
  I added some debugging to RTC, PIC and APIC. I also built a small
  guest in x86 assembly to test the coalescing. However, in the tests
  with this guest and others I noticed that the coalescing only happens
  in some obscure conditions.
  So try with real guest and with real load.

 Well, I'd like to get the test program also trigger it. Now I'm getting:
 apic: write: 0350 = 
 apic: apic_reset_irq_delivered: old coalescing 0
 apic: apic_local_deliver: vector 3 delivery mode 0
 apic: apic_set_irq: coalescing 1
 apic: apic_get_irq_delivered: returning coalescing 1
 apic: apic_reset_irq_delivered: old coalescing 1
 apic: apic_local_deliver: vector 3 delivery mode 0
 apic: apic_set_irq: coalescing 0
 apic: apic_get_irq_delivered: returning coalescing 0
 apic: apic_reset_irq_delivered: old coalescing 0
 apic: apic_local_deliver: vector 3 delivery mode 0
 apic: apic_set_irq: coalescing 0

 It looks like some other IRQs cause the coalescing, because also
 looking at RTC code, it seems it's not possible for RTC to raise the
 IRQ (except update IRQ, alarm etc.) without calling
 apic_reset_irq_delivered().

 I've attached my test program. Compile:
 gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
 coalescing.ld -g  objcopy -Obinary coalescing coalescing.bin

 Run:
 qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack

 The application does not work for me. Looks like it fails to enter
 protected mode. $pc jumps from 0xfff0 to 0x000f003e
 and back.

Strange. Here's a working binary.


 
  By default the APIC's delivery method for IRQs is ExtInt and
  coalescing counting happens only with Fixed. This means that the guest
  needs to reprogram APIC. It also looks like RTC interrupts need to be
  triggered. But I didn't see both of these to happen simultaneously in
  my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
  must be used and I also disabled HPET to be sure that RTC would be
  used.
 
  With DEBUG_COALESCING enabled, I just get increasing numbers for
  apic_irq_delivered:
  apic: apic_set_irq: coalescing 67123
  apic: apic_set_irq: coalescing 67124
  apic: apic_set_irq: coalescing 67125
  So have you actually used -rtc-td-hack option? I compiled head of
  qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
  and I get:
  apic: apic_reset_irq_delivered: old coalescing 3
  apic: apic_set_irq: coalescing 1
  apic: apic_get_irq_delivered: returning coalescing 1
  apic: apic_set_irq: coalescing 2
  apic: apic_set_irq: coalescing 3
  apic: apic_set_irq: coalescing 4
  apic: apic_set_irq: coalescing 5
  apic: apic_set_irq: coalescing 6
  apic: apic_reset_irq_delivered: old coalescing 6
  apic: apic_set_irq: coalescing 1
  apic: apic_get_irq_delivered: returning coalescing 1
 
 
  If the hack were active, the numbers would be close to zero (or at
  least some point) 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Blue Swirl
On Sun, May 30, 2010 at 12:24 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 Linux don't use RTC as time source and I don't know about BSD, so no
 Linux or BSD test case for you, sorry. Run WindowXP standard HAL and put
 heavy load on the host. You can run video inside the gust to trigger
 coalescing more easily.

 I don't have Windows XP, sorry.

 ReactOS [1], at least its 32-bit version, appears to use the RTC as well.

I tried LiveCD and QEMU versions, both seem to hang at boot. Is that expected?



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Blue Swirl
2010/5/30 Gleb Natapov g...@redhat.com:
 On Sun, May 30, 2010 at 12:10:16PM +, Blue Swirl wrote:
   You missed the key word 'stopped'. If the timer is really stopped, no
   IRQs should ever come out afterwards, just like on real HW. For the
   emulation, this means loss of ticks which should have been delivered
   before the change.
  
   I haven't missed it. I describe to you reality of the situation. You 
   want
   to change reality to be more close to what you want it to be by adding
   words to my description.
 
  Quoting Jan: 'So what to do with the backlog when the timer is
  stopped?' I didn't add any words to your description, please be more
  careful with your attributions. Why do you think I want to change the
  reality?
  Please refer to my words when you answer to my quote. You quoted my
  answer to you statement:
   Gleb only mentioned the frequency change, I thought that was not so big
   problem. But I don't think this case should be allowed happen at all,
   it can't exist on real HW.

 With 'this case' I was referring to 'case with timer stopped', not
 'case which Gleb mentioned'.

  No 'stopped' was under discussion nowhere.

 It's clearly written there in the sentence Jan wrote.

 Jan, not me, but lets leave this topic alone since you agree that
 stopped is just a case of frequency change anyway.

  FWIW 'stopped' is just a case
  of frequency change.

 True.

 
 
  XP frequency change isn't the same case as timer being stopped.
 
  And what is the big difference exactly?

 Because after the timer is stopped, its extremely unrealistic to send
 any IRQs. Whereas if the frequency is changed to some other nonzero
 value, we can cheat and inject some more queued IRQs.

 Correct, when gets disables clock source (by reset or any other means)
 coalesced backlog should be forgotten.

 Anyway, if this case is not interesting because it doesn't happen in
 real life emulation scenarios, we can forget it no matter how buggy
 the current QEMU implementation is.

   Please just go write code, experiment, debug
   and _then_ come here with design.
 
  I added some debugging to RTC, PIC and APIC. I also built a small
  guest in x86 assembly to test the coalescing. However, in the tests
  with this guest and others I noticed that the coalescing only happens
  in some obscure conditions.
  So try with real guest and with real load.

 Well, I'd like to get the test program also trigger it. Now I'm getting:
 apic: write: 0350 = 
 apic: apic_reset_irq_delivered: old coalescing 0
 apic: apic_local_deliver: vector 3 delivery mode 0
 apic: apic_set_irq: coalescing 1
 apic: apic_get_irq_delivered: returning coalescing 1
 apic: apic_reset_irq_delivered: old coalescing 1
 apic: apic_local_deliver: vector 3 delivery mode 0
 apic: apic_set_irq: coalescing 0
 apic: apic_get_irq_delivered: returning coalescing 0
 apic: apic_reset_irq_delivered: old coalescing 0
 apic: apic_local_deliver: vector 3 delivery mode 0
 apic: apic_set_irq: coalescing 0

 It looks like some other IRQs cause the coalescing, because also
 looking at RTC code, it seems it's not possible for RTC to raise the
 IRQ (except update IRQ, alarm etc.) without calling
 apic_reset_irq_delivered().

 I've attached my test program. Compile:
 gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
 coalescing.ld -g  objcopy -Obinary coalescing coalescing.bin

 Run:
 qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack

 The application does not work for me. Looks like it fails to enter
 protected mode. $pc jumps from 0xfff0 to 0x000f003e
 and back.

 
  By default the APIC's delivery method for IRQs is ExtInt and
  coalescing counting happens only with Fixed. This means that the guest
  needs to reprogram APIC. It also looks like RTC interrupts need to be
  triggered. But I didn't see both of these to happen simultaneously in
  my tests with Linux and Windows guests. Of course, -rtc-td-hack flag
  must be used and I also disabled HPET to be sure that RTC would be
  used.
 
  With DEBUG_COALESCING enabled, I just get increasing numbers for
  apic_irq_delivered:
  apic: apic_set_irq: coalescing 67123
  apic: apic_set_irq: coalescing 67124
  apic: apic_set_irq: coalescing 67125
  So have you actually used -rtc-td-hack option? I compiled head of
  qemu.git with DEBUG_COALESCING and run WindowsXP guest with -rtc-td-hack
  and I get:
  apic: apic_reset_irq_delivered: old coalescing 3
  apic: apic_set_irq: coalescing 1
  apic: apic_get_irq_delivered: returning coalescing 1
  apic: apic_set_irq: coalescing 2
  apic: apic_set_irq: coalescing 3
  apic: apic_set_irq: coalescing 4
  apic: apic_set_irq: coalescing 5
  apic: apic_set_irq: coalescing 6
  apic: apic_reset_irq_delivered: old coalescing 6
  apic: apic_set_irq: coalescing 1
  apic: apic_get_irq_delivered: returning coalescing 1
 
 
  If the hack were active, the numbers would be close to zero (or at
  least some point) because apic_reset_irq_delivered would be 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Gleb Natapov
On Sun, May 30, 2010 at 12:56:26PM +, Blue Swirl wrote:
  Well, I'd like to get the test program also trigger it. Now I'm getting:
  apic: write: 0350 = 
  apic: apic_reset_irq_delivered: old coalescing 0
  apic: apic_local_deliver: vector 3 delivery mode 0
  apic: apic_set_irq: coalescing 1
  apic: apic_get_irq_delivered: returning coalescing 1
  apic: apic_reset_irq_delivered: old coalescing 1
  apic: apic_local_deliver: vector 3 delivery mode 0
  apic: apic_set_irq: coalescing 0
  apic: apic_get_irq_delivered: returning coalescing 0
  apic: apic_reset_irq_delivered: old coalescing 0
  apic: apic_local_deliver: vector 3 delivery mode 0
  apic: apic_set_irq: coalescing 0
 
So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
0 it means the interrupt was not delivered.

  It looks like some other IRQs cause the coalescing, because also
  looking at RTC code, it seems it's not possible for RTC to raise the
  IRQ (except update IRQ, alarm etc.) without calling
  apic_reset_irq_delivered().
 
  I've attached my test program. Compile:
  gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
  coalescing.ld -g  objcopy -Obinary coalescing coalescing.bin
 
  Run:
  qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack
 
  The application does not work for me. Looks like it fails to enter
  protected mode. $pc jumps from 0xfff0 to 0x000f003e
  and back.
 
 Strange. Here's a working binary.
 
Your binary works here too. What compiler are you using?

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Blue Swirl
On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov g...@redhat.com wrote:
 On Sun, May 30, 2010 at 12:56:26PM +, Blue Swirl wrote:
  Well, I'd like to get the test program also trigger it. Now I'm getting:
  apic: write: 0350 = 
  apic: apic_reset_irq_delivered: old coalescing 0
  apic: apic_local_deliver: vector 3 delivery mode 0
  apic: apic_set_irq: coalescing 1
  apic: apic_get_irq_delivered: returning coalescing 1
  apic: apic_reset_irq_delivered: old coalescing 1
  apic: apic_local_deliver: vector 3 delivery mode 0
  apic: apic_set_irq: coalescing 0
  apic: apic_get_irq_delivered: returning coalescing 0
  apic: apic_reset_irq_delivered: old coalescing 0
  apic: apic_local_deliver: vector 3 delivery mode 0
  apic: apic_set_irq: coalescing 0
 
 So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
 0 it means the interrupt was not delivered.

  It looks like some other IRQs cause the coalescing, because also
  looking at RTC code, it seems it's not possible for RTC to raise the
  IRQ (except update IRQ, alarm etc.) without calling
  apic_reset_irq_delivered().
 
  I've attached my test program. Compile:
  gcc -m32 -o coalescing coalescing.S -ffreestanding -nostdlib -Wl,-T
  coalescing.ld -g  objcopy -Obinary coalescing coalescing.bin
 
  Run:
  qemu -L . -bios coalescing.bin -no-hpet -rtc-td-hack
 
  The application does not work for me. Looks like it fails to enter
  protected mode. $pc jumps from 0xfff0 to 0x000f003e
  and back.

 Strange. Here's a working binary.

 Your binary works here too. What compiler are you using?

Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian
4.3.2-1.1' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
--enable-mpfr --enable-cld --enable-checking=release
--build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 4.3.2 (Debian 4.3.2-1.1)



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Blue Swirl
On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov g...@redhat.com wrote:
 On Sun, May 30, 2010 at 12:56:26PM +, Blue Swirl wrote:
  Well, I'd like to get the test program also trigger it. Now I'm getting:
  apic: write: 0350 = 
  apic: apic_reset_irq_delivered: old coalescing 0
  apic: apic_local_deliver: vector 3 delivery mode 0
  apic: apic_set_irq: coalescing 1
  apic: apic_get_irq_delivered: returning coalescing 1
  apic: apic_reset_irq_delivered: old coalescing 1
  apic: apic_local_deliver: vector 3 delivery mode 0
  apic: apic_set_irq: coalescing 0
  apic: apic_get_irq_delivered: returning coalescing 0
  apic: apic_reset_irq_delivered: old coalescing 0
  apic: apic_local_deliver: vector 3 delivery mode 0
  apic: apic_set_irq: coalescing 0
 
 So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
 0 it means the interrupt was not delivered.

That seems strange. I changed the program so that the handler gets
executed, also output a dot to serial from the handler. I changed the
frequency to 2Hz.

Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
With -rtc-td-hack, the dots come out much faster. I added
DEBUG_COALESCING also to RTC, with that enabled I get:
qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial stdio
cmos: coalesced irqs scaled to 0
cmos: coalesced irqs increased to 1
cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
.cmos: injecting on ack
..cmos: injecting from timer
.cmos: coalesced irqs increased to 2
cmos: injecting on ack

So, there are bogus injections.


coalescing.S
Description: Binary data


coalescing.bin.bz2
Description: BZip2 compressed data


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Gleb Natapov
On Sun, May 30, 2010 at 07:37:59PM +, Blue Swirl wrote:
 On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov g...@redhat.com wrote:
  On Sun, May 30, 2010 at 12:56:26PM +, Blue Swirl wrote:
   Well, I'd like to get the test program also trigger it. Now I'm getting:
   apic: write: 0350 = 
   apic: apic_reset_irq_delivered: old coalescing 0
   apic: apic_local_deliver: vector 3 delivery mode 0
   apic: apic_set_irq: coalescing 1
   apic: apic_get_irq_delivered: returning coalescing 1
   apic: apic_reset_irq_delivered: old coalescing 1
   apic: apic_local_deliver: vector 3 delivery mode 0
   apic: apic_set_irq: coalescing 0
   apic: apic_get_irq_delivered: returning coalescing 0
   apic: apic_reset_irq_delivered: old coalescing 0
   apic: apic_local_deliver: vector 3 delivery mode 0
   apic: apic_set_irq: coalescing 0
  
  So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
  0 it means the interrupt was not delivered.
 
 That seems strange. I changed the program so that the handler gets
 executed, also output a dot to serial from the handler. I changed the
 frequency to 2Hz.
 
 Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
 With -rtc-td-hack, the dots come out much faster. I added
 DEBUG_COALESCING also to RTC, with that enabled I get:
 qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial 
 stdio
 cmos: coalesced irqs scaled to 0
 cmos: coalesced irqs increased to 1
 cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 ..cmos: injecting from timer
 .cmos: coalesced irqs increased to 2
 cmos: injecting on ack
 
 So, there are bogus injections.

Looks like irr in apic is never cleared. Probably bug in userspace apic
emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
or use qemu-kvm with in kernel irq chip.

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Blue Swirl
On Sun, May 30, 2010 at 8:07 PM, Gleb Natapov g...@redhat.com wrote:
 On Sun, May 30, 2010 at 07:37:59PM +, Blue Swirl wrote:
 On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov g...@redhat.com wrote:
  On Sun, May 30, 2010 at 12:56:26PM +, Blue Swirl wrote:
   Well, I'd like to get the test program also trigger it. Now I'm 
   getting:
   apic: write: 0350 = 
   apic: apic_reset_irq_delivered: old coalescing 0
   apic: apic_local_deliver: vector 3 delivery mode 0
   apic: apic_set_irq: coalescing 1
   apic: apic_get_irq_delivered: returning coalescing 1
   apic: apic_reset_irq_delivered: old coalescing 1
   apic: apic_local_deliver: vector 3 delivery mode 0
   apic: apic_set_irq: coalescing 0
   apic: apic_get_irq_delivered: returning coalescing 0
   apic: apic_reset_irq_delivered: old coalescing 0
   apic: apic_local_deliver: vector 3 delivery mode 0
   apic: apic_set_irq: coalescing 0
  
  So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
  0 it means the interrupt was not delivered.

 That seems strange. I changed the program so that the handler gets
 executed, also output a dot to serial from the handler. I changed the
 frequency to 2Hz.

 Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
 With -rtc-td-hack, the dots come out much faster. I added
 DEBUG_COALESCING also to RTC, with that enabled I get:
 qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial 
 stdio
 cmos: coalesced irqs scaled to 0
 cmos: coalesced irqs increased to 1
 cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 .cmos: injecting on ack
 ..cmos: injecting from timer
 .cmos: coalesced irqs increased to 2
 cmos: injecting on ack

 So, there are bogus injections.

 Looks like irr in apic is never cleared. Probably bug in userspace apic
 emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
 or use qemu-kvm with in kernel irq chip.

With APIC you mean Fixed? Then the IRQ is not delivered at all.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-30 Thread Gleb Natapov
On Sun, May 30, 2010 at 08:21:30PM +, Blue Swirl wrote:
 On Sun, May 30, 2010 at 8:07 PM, Gleb Natapov g...@redhat.com wrote:
  On Sun, May 30, 2010 at 07:37:59PM +, Blue Swirl wrote:
  On Sun, May 30, 2010 at 1:49 PM, Gleb Natapov g...@redhat.com wrote:
   On Sun, May 30, 2010 at 12:56:26PM +, Blue Swirl wrote:
Well, I'd like to get the test program also trigger it. Now I'm 
getting:
apic: write: 0350 = 
apic: apic_reset_irq_delivered: old coalescing 0
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 1
apic: apic_get_irq_delivered: returning coalescing 1
apic: apic_reset_irq_delivered: old coalescing 1
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 0
apic: apic_get_irq_delivered: returning coalescing 0
apic: apic_reset_irq_delivered: old coalescing 0
apic: apic_local_deliver: vector 3 delivery mode 0
apic: apic_set_irq: coalescing 0
   
   So interrupt is _alway_ coalesced. If apic_get_irq_delivered() returns
   0 it means the interrupt was not delivered.
 
  That seems strange. I changed the program so that the handler gets
  executed, also output a dot to serial from the handler. I changed the
  frequency to 2Hz.
 
  Now, if I leave out -rtc-td-hack, I get the dots at 2Hz as expected.
  With -rtc-td-hack, the dots come out much faster. I added
  DEBUG_COALESCING also to RTC, with that enabled I get:
  qemu -L . -bios coalescing.bin -d in_asm,int -no-hpet -rtc-td-hack -serial 
  stdio
  cmos: coalesced irqs scaled to 0
  cmos: coalesced irqs increased to 1
  cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  .cmos: injecting on ack
  ..cmos: injecting from timer
  .cmos: coalesced irqs increased to 2
  cmos: injecting on ack
 
  So, there are bogus injections.
 
  Looks like irr in apic is never cleared. Probably bug in userspace apic
  emulation. I'll look into it. Try to route interrupt via APIC (not ExtInt),
  or use qemu-kvm with in kernel irq chip.
 
 With APIC you mean Fixed? Then the IRQ is not delivered at all.
You need to deliver it through IOAPIC.

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Gleb Natapov
On Sat, May 29, 2010 at 09:15:11AM +, Blue Swirl wrote:
  There is no code, because we're still at architecture design stage.
 
  Try to write test code to understand the problem better.
 
 I will.
 
Please do ASAP. This discussion shows that you don't understand what is the
problem that we are dialing with.

guests could also be assisted with special handling (like win2k
install hack), for example guest instructions could be counted
(approximately, for example using TB size or TSC) and only inject
after at least N instructions have passed.
Guest instructions cannot be easily counted in KVM (it can be done 
more
or less reliably using perf counters, may be).
  
   Aren't there any debug registers or perf counters, which can generate
   an interrupt after some number of instructions have been executed?
   Don't think debug registers have something like that and they are
   available for guest use anyway. Perf counters differs greatly from CPU
   to CPU (even between two CPUs of the same manufacturer), and we want to
   keep using them for profiling guests. And I don't see what problem it
   will solve anyway that can be solved by simple delay between irq
   reinjection.
 
  This would allow counting the executed instructions and limit it. Thus
  we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
 
  Why would you want to limit number of instruction executed by guest if
  CPU has nothing else to do anyway? The problem occurs not when we have
  spare cycles so give to a guest, but in opposite case.
 
 I think one problem is that the guest has executed too much compared
 to what would happen with real HW with a lesser CPU. That explains the
 RTC frequency reprogramming case.
You think wrong. The problem is exactly opposite: the guest haven't
had enough execution time between two time interrupts. I don't know what
RTC frequency reprogramming case you are talking about here.

 
 
  
   
 And even if the rate did not matter, the APIC woult still have to 
 now
 about the fact that an IRQ is really periodic and does not only 
 appear
 as such for a certain interval. This really does not sound like
 simplifying things or even make them cleaner.
   
It would, the voodoo would be contained only in APIC, RTC would be
just like any other device. With the bidirectional irqs, this voodoo
would probably eventually spread to many other devices. The logical
conclusion of that would be a system where all devices would be
careful not to disturb the guest at wrong moment because that would
trigger a bug.
   
This voodoo will be so complex and unreliable that it will make RTC 
hack
pale in comparison (and I still don't see how you are going to make it
actually work).
  
   Implement everything inside APIC: only coalescing and reinjection.
   APIC has zero info needed to implement reinjection correctly as was
   shown to you several time in this thread and you simply keep ignoring
   it.
 
  On the contrary, APIC is actually the only source of the IRQ ack
  information. RTC hack would not work without APIC (or the
  bidirectional IRQ) passing this info to RTC.
 
  What APIC doesn't have now is the timer frequency or period info. This
  is known by RTC and also higher levels managing the clocks.
 
  So APIC has one bit of information and RTC everything else.
 
 The information known by RTC (timer period) is also known by higher levels.
 
What do you mean by higher level here? vl.c or APIC.

  The current
  approach (and proposed patch) brings this one bit of information to RTC,
  you are arguing that RTC should be able to communicate all its info to
  APIC. Sorry I don't see that your way has any advantage. Just more
  complex interface and it is much easier to get it wrong for other time
  sources.
 
 I don't think anymore that APIC should be handling this but the
 generic stuff, like vl.c or exec.c. Then there would be only
 information passing from APIC to higher levels.
Handling reinjection by general timer code makes infinitely more sense
then handling it in APIC. One thing (from the top of my head) that can't
be implemented at that level is injection of interrupt back to back (i.e
injecting next interrupt immediately after guest acknowledge previous
one to RTC).

 
  I keep ignoring the idea that the current model, where both RTC and
  APIC must somehow work together to make coalescing work, is the only
  possible just because it is committed and it happens to work in some
  cases. It would be much better to concentrate this to one place, APIC
  or preferably higher level where it may benefit other timers too.
  Provided of course that the other models can be made to work.
 
  So write the code and show us. You haven't show any evidence that RTC is
  the wrong place. RTC knows when interrupt was acknowledge to RTC, it
  know when clock frequency changes, it know when device reset happened.
  APIC knows only that 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Gleb Natapov
On Sat, May 29, 2010 at 09:35:30AM +, Blue Swirl wrote:
  I still don't see how the alternative is supposed to simplify our life
  or improve the efficiency of the de-coalescing workaround. It's rather
  problematic like Gleb pointed out: The de-coalescing logic needs to be
  informed about periodicity changes that can only be delivered along
  IRQs. So what to do with the backlog when the timer is stopped?
 
 What happens with the current design? Gleb only mentioned the
 frequency change, I thought that was not so big problem. But I don't
 think this case should be allowed happen at all, it can't exist on
 real HW.
 
Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
inside QEMU, connect with gdb to QEMU process and check what frequency
RTC configured with (hint: it will be 64Hz), now run video inside the
guest and check frequency again (hint: it will be 1Khz).

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Blue Swirl
On Sat, May 29, 2010 at 10:16 AM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 This is - according to my current understanding - the proposed
 alternative architecture:

                                          .---.
                                          | de-coalescing |
                                          |     logic     |
                                          '---'
                                                ^   |
                                         period,|   |IRQ
                                       coalesced|   |(or tuned VM clock)
                                        (yes/no)|   v
 .---.              ..             .---.
 |  RTC  |-IRQ-| router |-IRQ| APIC  |
 '---'              ''             '---'
    |                    ^    |                   ^
    |                    |    |                   |
    '---period---'    '--period---'
 The period information is already known by the higher level clock
 management,
 The timer device models program the next event of some qemu-timer. There
 is no tag attached explaining the timer subsystem or anyone else the
 reason behind this programming.

 Yes, but why would we care? All timers in the system besides RTC
 should be affected likewise, this would in fact be the benefit from
 handling the problem at higher level.

 And how does your equation for calculating the clock slow-down look like?

How about icount_adjust()?


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Jan Kiszka
Blue Swirl wrote:
 This is - according to my current understanding - the proposed
 alternative architecture:

  .---.
  | de-coalescing |
  | logic |
  '---'
^   |
 period,|   |IRQ
   coalesced|   |(or tuned VM clock)
(yes/no)|   v
 .---.  .. .---.
 |  RTC  |-IRQ-| router |-IRQ| APIC  |
 '---'  '' '---'
|^|   ^
|||   |
'---period---''--period---'
 The period information is already known by the higher level clock
 management,
 The timer device models program the next event of some qemu-timer. There
 is no tag attached explaining the timer subsystem or anyone else the
 reason behind this programming.
 
 Yes, but why would we care? All timers in the system besides RTC
 should be affected likewise, this would in fact be the benefit from
 handling the problem at higher level.

And how does your equation for calculating the clock slow-down look like?

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Jan Kiszka
Gleb Natapov wrote:
 On Fri, May 28, 2010 at 08:06:45PM +, Blue Swirl wrote:
 2010/5/28 Gleb Natapov g...@redhat.com:
 On Thu, May 27, 2010 at 06:37:10PM +, Blue Swirl wrote:
 2010/5/27 Gleb Natapov g...@redhat.com:
 On Wed, May 26, 2010 at 08:35:00PM +, Blue Swirl wrote:
 On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka jan.kis...@web.de wrote:
 Anthony Liguori wrote:
 On 05/25/2010 02:09 PM, Blue Swirl wrote:
 On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  
 wrote:

 From: Jan Kiszkajan.kis...@siemens.com

 This allows to communicate potential IRQ coalescing during 
 delivery from
 the sink back to the source. Targets that support IRQ coalescing
 workarounds need to register handlers that return the appropriate
 QEMU_IRQ_* code, and they have to propergate the code across all 
 IRQ
 redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it 
 can
 apply its workaround. If multiple sinks exist, the source may only
 consider an IRQ coalesced if all other sinks either report
 QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.

 No real devices are interested whether any of their output lines are
 even connected. This would introduce a new signal type, 
 bidirectional
 multi-level, which is not correct.

 I don't think it's really an issue of correct, but I wouldn't 
 disagree
 to a suggestion that we ought to introduce a new signal type for this
 type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and 
 has a
 similar interface as qemu_irq.
 A separate type would complicate the delivery of the feedback value
 across GPIO pins (as Paul requested for the RTC-HPET routing).

 I think the real solution to coalescing is put the logic inside one
 device, in this case APIC because it has the information about irq
 delivery. APIC could monitor incoming RTC irqs for frequency
 information and whether they get delivered or not. If not, an 
 internal
 timer is installed which injects the lost irqs.
 That won't fly as the IRQs will already arrive at the APIC with a
 sufficiently high jitter. At the bare minimum, you need to tell the
 interrupt controller about the fact that a particular IRQ should be
 delivered at a specific regular rate. For this, you also need a 
 generic
 interface - nothing really won.
 OK, let's simplify: just reinject at next possible chance. No need to
 monitor or tell anything.
 There are guests that won't like this (I know of one in-house, but
 others may even have more examples), specifically if you end up firing
 multiple IRQs in a row due to a longer backlog. For that reason, the RTC
 spreads the reinjection according to the current rate.
 Then reinject with a constant delay, or next CPU exit. Such buggy
 If guest's time frequency is the same as host time frequency you can't
 reinject with constant delay. That is why current code mixes two
 approaches: reinject M interrupts in a raw then delay.
 This approach can be also used by APIC-only version.

 I don't know what APIC-only version you are talking about. I haven't
 seen the code and I don't understand hand waving, sorry.
 There is no code, because we're still at architecture design stage.

 Try to write test code to understand the problem better.
 
 guests could also be assisted with special handling (like win2k
 install hack), for example guest instructions could be counted
 (approximately, for example using TB size or TSC) and only inject
 after at least N instructions have passed.
 Guest instructions cannot be easily counted in KVM (it can be done more
 or less reliably using perf counters, may be).
 Aren't there any debug registers or perf counters, which can generate
 an interrupt after some number of instructions have been executed?
 Don't think debug registers have something like that and they are
 available for guest use anyway. Perf counters differs greatly from CPU
 to CPU (even between two CPUs of the same manufacturer), and we want to
 keep using them for profiling guests. And I don't see what problem it
 will solve anyway that can be solved by simple delay between irq
 reinjection.
 This would allow counting the executed instructions and limit it. Thus
 we could emulate a 500MHz CPU on a 2GHz CPU more accurately.

 Why would you want to limit number of instruction executed by guest if
 CPU has nothing else to do anyway? The problem occurs not when we have
 spare cycles so give to a guest, but in opposite case.
 
 And even if the rate did not matter, the APIC woult still have to now
 about the fact that an IRQ is really periodic and does not only appear
 as such for a certain interval. This really does not sound like
 simplifying things or even make them cleaner.
 It would, the voodoo would be contained only in APIC, RTC would be
 just like any other device. With the bidirectional irqs, this voodoo
 would probably eventually spread to many other devices. The logical
 conclusion of that would be a 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Jan Kiszka
Blue Swirl wrote:
 2010/5/29 Jan Kiszka jan.kis...@web.de:
 Gleb Natapov wrote:
 On Fri, May 28, 2010 at 08:06:45PM +, Blue Swirl wrote:
 2010/5/28 Gleb Natapov g...@redhat.com:
 On Thu, May 27, 2010 at 06:37:10PM +, Blue Swirl wrote:
 2010/5/27 Gleb Natapov g...@redhat.com:
 On Wed, May 26, 2010 at 08:35:00PM +, Blue Swirl wrote:
 On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka jan.kis...@web.de 
 wrote:
 Anthony Liguori wrote:
 On 05/25/2010 02:09 PM, Blue Swirl wrote:
 On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  
 wrote:

 From: Jan Kiszkajan.kis...@siemens.com

 This allows to communicate potential IRQ coalescing during 
 delivery from
 the sink back to the source. Targets that support IRQ coalescing
 workarounds need to register handlers that return the appropriate
 QEMU_IRQ_* code, and they have to propergate the code across all 
 IRQ
 redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, 
 it can
 apply its workaround. If multiple sinks exist, the source may 
 only
 consider an IRQ coalesced if all other sinks either report
 QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.

 No real devices are interested whether any of their output lines 
 are
 even connected. This would introduce a new signal type, 
 bidirectional
 multi-level, which is not correct.

 I don't think it's really an issue of correct, but I wouldn't 
 disagree
 to a suggestion that we ought to introduce a new signal type for 
 this
 type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and 
 has a
 similar interface as qemu_irq.
 A separate type would complicate the delivery of the feedback value
 across GPIO pins (as Paul requested for the RTC-HPET routing).

 I think the real solution to coalescing is put the logic inside 
 one
 device, in this case APIC because it has the information about irq
 delivery. APIC could monitor incoming RTC irqs for frequency
 information and whether they get delivered or not. If not, an 
 internal
 timer is installed which injects the lost irqs.
 That won't fly as the IRQs will already arrive at the APIC with a
 sufficiently high jitter. At the bare minimum, you need to tell the
 interrupt controller about the fact that a particular IRQ should be
 delivered at a specific regular rate. For this, you also need a 
 generic
 interface - nothing really won.
 OK, let's simplify: just reinject at next possible chance. No need to
 monitor or tell anything.
 There are guests that won't like this (I know of one in-house, but
 others may even have more examples), specifically if you end up firing
 multiple IRQs in a row due to a longer backlog. For that reason, the 
 RTC
 spreads the reinjection according to the current rate.
 Then reinject with a constant delay, or next CPU exit. Such buggy
 If guest's time frequency is the same as host time frequency you can't
 reinject with constant delay. That is why current code mixes two
 approaches: reinject M interrupts in a raw then delay.
 This approach can be also used by APIC-only version.

 I don't know what APIC-only version you are talking about. I haven't
 seen the code and I don't understand hand waving, sorry.
 There is no code, because we're still at architecture design stage.

 Try to write test code to understand the problem better.

 guests could also be assisted with special handling (like win2k
 install hack), for example guest instructions could be counted
 (approximately, for example using TB size or TSC) and only inject
 after at least N instructions have passed.
 Guest instructions cannot be easily counted in KVM (it can be done more
 or less reliably using perf counters, may be).
 Aren't there any debug registers or perf counters, which can generate
 an interrupt after some number of instructions have been executed?
 Don't think debug registers have something like that and they are
 available for guest use anyway. Perf counters differs greatly from CPU
 to CPU (even between two CPUs of the same manufacturer), and we want to
 keep using them for profiling guests. And I don't see what problem it
 will solve anyway that can be solved by simple delay between irq
 reinjection.
 This would allow counting the executed instructions and limit it. Thus
 we could emulate a 500MHz CPU on a 2GHz CPU more accurately.

 Why would you want to limit number of instruction executed by guest if
 CPU has nothing else to do anyway? The problem occurs not when we have
 spare cycles so give to a guest, but in opposite case.

 And even if the rate did not matter, the APIC woult still have to now
 about the fact that an IRQ is really periodic and does not only appear
 as such for a certain interval. This really does not sound like
 simplifying things or even make them cleaner.
 It would, the voodoo would be contained only in APIC, RTC would be
 just like any other device. With the bidirectional irqs, this voodoo
 would probably eventually spread 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Blue Swirl
On Sat, May 29, 2010 at 9:45 AM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 2010/5/29 Jan Kiszka jan.kis...@web.de:
 Gleb Natapov wrote:
 On Fri, May 28, 2010 at 08:06:45PM +, Blue Swirl wrote:
 2010/5/28 Gleb Natapov g...@redhat.com:
 On Thu, May 27, 2010 at 06:37:10PM +, Blue Swirl wrote:
 2010/5/27 Gleb Natapov g...@redhat.com:
 On Wed, May 26, 2010 at 08:35:00PM +, Blue Swirl wrote:
 On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka jan.kis...@web.de 
 wrote:
 Anthony Liguori wrote:
 On 05/25/2010 02:09 PM, Blue Swirl wrote:
 On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  
 wrote:

 From: Jan Kiszkajan.kis...@siemens.com

 This allows to communicate potential IRQ coalescing during 
 delivery from
 the sink back to the source. Targets that support IRQ coalescing
 workarounds need to register handlers that return the 
 appropriate
 QEMU_IRQ_* code, and they have to propergate the code across 
 all IRQ
 redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, 
 it can
 apply its workaround. If multiple sinks exist, the source may 
 only
 consider an IRQ coalesced if all other sinks either report
 QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.

 No real devices are interested whether any of their output lines 
 are
 even connected. This would introduce a new signal type, 
 bidirectional
 multi-level, which is not correct.

 I don't think it's really an issue of correct, but I wouldn't 
 disagree
 to a suggestion that we ought to introduce a new signal type for 
 this
 type of bidirectional feedback.  Maybe it's qemu_coalesced_irq 
 and has a
 similar interface as qemu_irq.
 A separate type would complicate the delivery of the feedback value
 across GPIO pins (as Paul requested for the RTC-HPET routing).

 I think the real solution to coalescing is put the logic inside 
 one
 device, in this case APIC because it has the information about 
 irq
 delivery. APIC could monitor incoming RTC irqs for frequency
 information and whether they get delivered or not. If not, an 
 internal
 timer is installed which injects the lost irqs.
 That won't fly as the IRQs will already arrive at the APIC with a
 sufficiently high jitter. At the bare minimum, you need to tell the
 interrupt controller about the fact that a particular IRQ should be
 delivered at a specific regular rate. For this, you also need a 
 generic
 interface - nothing really won.
 OK, let's simplify: just reinject at next possible chance. No need 
 to
 monitor or tell anything.
 There are guests that won't like this (I know of one in-house, but
 others may even have more examples), specifically if you end up 
 firing
 multiple IRQs in a row due to a longer backlog. For that reason, the 
 RTC
 spreads the reinjection according to the current rate.
 Then reinject with a constant delay, or next CPU exit. Such buggy
 If guest's time frequency is the same as host time frequency you can't
 reinject with constant delay. That is why current code mixes two
 approaches: reinject M interrupts in a raw then delay.
 This approach can be also used by APIC-only version.

 I don't know what APIC-only version you are talking about. I haven't
 seen the code and I don't understand hand waving, sorry.
 There is no code, because we're still at architecture design stage.

 Try to write test code to understand the problem better.

 guests could also be assisted with special handling (like win2k
 install hack), for example guest instructions could be counted
 (approximately, for example using TB size or TSC) and only inject
 after at least N instructions have passed.
 Guest instructions cannot be easily counted in KVM (it can be done more
 or less reliably using perf counters, may be).
 Aren't there any debug registers or perf counters, which can generate
 an interrupt after some number of instructions have been executed?
 Don't think debug registers have something like that and they are
 available for guest use anyway. Perf counters differs greatly from CPU
 to CPU (even between two CPUs of the same manufacturer), and we want to
 keep using them for profiling guests. And I don't see what problem it
 will solve anyway that can be solved by simple delay between irq
 reinjection.
 This would allow counting the executed instructions and limit it. Thus
 we could emulate a 500MHz CPU on a 2GHz CPU more accurately.

 Why would you want to limit number of instruction executed by guest if
 CPU has nothing else to do anyway? The problem occurs not when we have
 spare cycles so give to a guest, but in opposite case.

 And even if the rate did not matter, the APIC woult still have to now
 about the fact that an IRQ is really periodic and does not only 
 appear
 as such for a certain interval. This really does not sound like
 simplifying things or even make them cleaner.
 It would, the voodoo would be contained only in APIC, RTC would be
 just like any other 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Jan Kiszka
Blue Swirl wrote:
 On the contrary, APIC is actually the only source of the IRQ ack
 information. RTC hack would not work without APIC (or the
 bidirectional IRQ) passing this info to RTC.

 What APIC doesn't have now is the timer frequency or period info. This
 is known by RTC and also higher levels managing the clocks.

 So APIC has one bit of information and RTC everything else.
 
 The information known by RTC (timer period) is also known by higher levels.

Curious to see where you'll find this.

 
 The current
 approach (and proposed patch) brings this one bit of information to RTC,
 you are arguing that RTC should be able to communicate all its info to
 APIC. Sorry I don't see that your way has any advantage. Just more
 complex interface and it is much easier to get it wrong for other time
 sources.
 
 I don't think anymore that APIC should be handling this but the
 generic stuff, like vl.c or exec.c. Then there would be only
 information passing from APIC to higher levels.

You neglect the the information required to associate a periodic source
(e.g. RTC) with an IRQ sink (e.g. APIC). Without that, you will have a
hard time figuring out if a reported IRQ coalescing requires any
activities or should simply be welcomed (for I/O IRQs).

 
 I keep ignoring the idea that the current model, where both RTC and
 APIC must somehow work together to make coalescing work, is the only
 possible just because it is committed and it happens to work in some
 cases. It would be much better to concentrate this to one place, APIC
 or preferably higher level where it may benefit other timers too.
 Provided of course that the other models can be made to work.

 So write the code and show us. You haven't show any evidence that RTC is
 the wrong place. RTC knows when interrupt was acknowledge to RTC, it
 know when clock frequency changes, it know when device reset happened.
 APIC knows only that interrupt was coalesced. It doesn't even know that
 it may be masked by a guest in IOAPIC (interrupts delivered while they
 are masked not considered coalesced).
 
 Oh, I thought interrupt masking was the reason for coalescing! What
 exactly is the reason then?

Missing acks, ie. the IRQ is still pending when the next one arrives.
You want to filter out masked/suppressed IRQs to avoid running the
de-coalescing logic on sources that are actually cut off (like the RTC
IRQ when the HPET took over).

 
 Time source knows only when
 frequency changes and may be when device reset happens if timer is
 stopped by device on reset. So RTC is actually a sweet spot if you want
 to minimize amount of info you need to pass between various layers.

 Maybe that version would not bend backwards as much as the current to
 cater for buggy hosts.

 You mean buggy guests?
 Yes, sorry.

 What guests are not buggy in your opinion?
 Linux tries hard to be smart and as a result the only way to have stable
 clock with it is to go paravirt.
 I'm not an OS designer, but I think an OS should never crash, even if
 a burst of IRQs is received. Reprogramming the timer should consider
 the pending IRQ situation (0 or 1 with real HW). Those bugs are one
 cause of the problem.
 OS should never crash in the absence of HW bugs? I doubt you can design
 an OS that can run in a face of any HW failure. Anyway here we are
 trying to solve guests time keeping problem not crashes. Do you think
 you can design OS that can keep time accurately no matter how crazy all
 HW clock behaves?
 
 I think my OS design skills are not relevant in this discussion, but
 IIRC there are fault tolerant operating systems for extreme conditions
 so it can be done.

No one can influence the design of released OS versions anymore.

 
 The fact is that timer device is not just like any
 other device in virtual world. Any other device is easy: you just
 implement spec as close as possible and everything works. For time
 source device this is not enough. You can implement RTC+HPET to the
 letter and your guest will drift like crazy.
 It's doable: a cycle accurate emulator will not cause any drift,
 without any voodoo. The interrupts would come after executing the same
 instruction as the real HW. For emulating any sufficiently buggy
 guests in any sufficiently desperate low resource conditions, this may
 be the only option that will always work.

 Yes, but qemu and kvm are not cycle accurate emulators and don't strive
 to be one. On the contrary KVM runs at native host CPU speed most of the
 time, so any emulation done between two instruction is theoretically
 noticeable for a guest. TSC is bypassed directly to a guest too, so
 keeping all time source in perfect sync is also impossible.
 That is actually another cause of the problem. KVM gives the guest an
 illusion that the VCPU speed is equal to host speed. When they don't
 match, especially in critical code, there can be problems. It would be
 better to tell the guest a lower speed, which also can be guaranteed.

 Not possible. It's that simple. 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Blue Swirl
2010/5/28 Gleb Natapov g...@redhat.com:
 On Fri, May 28, 2010 at 08:06:45PM +, Blue Swirl wrote:
 2010/5/28 Gleb Natapov g...@redhat.com:
  On Thu, May 27, 2010 at 06:37:10PM +, Blue Swirl wrote:
  2010/5/27 Gleb Natapov g...@redhat.com:
   On Wed, May 26, 2010 at 08:35:00PM +, Blue Swirl wrote:
   On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka jan.kis...@web.de wrote:
Blue Swirl wrote:
On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka jan.kis...@web.de 
wrote:
Anthony Liguori wrote:
On 05/25/2010 02:09 PM, Blue Swirl wrote:
On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  
wrote:
   
From: Jan Kiszkajan.kis...@siemens.com
   
This allows to communicate potential IRQ coalescing during 
delivery from
the sink back to the source. Targets that support IRQ coalescing
workarounds need to register handlers that return the 
appropriate
QEMU_IRQ_* code, and they have to propergate the code across 
all IRQ
redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, 
it can
apply its workaround. If multiple sinks exist, the source may 
only
consider an IRQ coalesced if all other sinks either report
QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
   
No real devices are interested whether any of their output lines 
are
even connected. This would introduce a new signal type, 
bidirectional
multi-level, which is not correct.
   
I don't think it's really an issue of correct, but I wouldn't 
disagree
to a suggestion that we ought to introduce a new signal type for 
this
type of bidirectional feedback.  Maybe it's qemu_coalesced_irq 
and has a
similar interface as qemu_irq.
A separate type would complicate the delivery of the feedback value
across GPIO pins (as Paul requested for the RTC-HPET routing).
   
I think the real solution to coalescing is put the logic inside 
one
device, in this case APIC because it has the information about 
irq
delivery. APIC could monitor incoming RTC irqs for frequency
information and whether they get delivered or not. If not, an 
internal
timer is installed which injects the lost irqs.
That won't fly as the IRQs will already arrive at the APIC with a
sufficiently high jitter. At the bare minimum, you need to tell the
interrupt controller about the fact that a particular IRQ should be
delivered at a specific regular rate. For this, you also need a 
generic
interface - nothing really won.
   
OK, let's simplify: just reinject at next possible chance. No need 
to
monitor or tell anything.
   
There are guests that won't like this (I know of one in-house, but
others may even have more examples), specifically if you end up 
firing
multiple IRQs in a row due to a longer backlog. For that reason, the 
RTC
spreads the reinjection according to the current rate.
  
   Then reinject with a constant delay, or next CPU exit. Such buggy
   If guest's time frequency is the same as host time frequency you can't
   reinject with constant delay. That is why current code mixes two
   approaches: reinject M interrupts in a raw then delay.
 
  This approach can be also used by APIC-only version.
 
  I don't know what APIC-only version you are talking about. I haven't
  seen the code and I don't understand hand waving, sorry.

 There is no code, because we're still at architecture design stage.

 Try to write test code to understand the problem better.

I will.

   guests could also be assisted with special handling (like win2k
   install hack), for example guest instructions could be counted
   (approximately, for example using TB size or TSC) and only inject
   after at least N instructions have passed.
   Guest instructions cannot be easily counted in KVM (it can be done more
   or less reliably using perf counters, may be).
 
  Aren't there any debug registers or perf counters, which can generate
  an interrupt after some number of instructions have been executed?
  Don't think debug registers have something like that and they are
  available for guest use anyway. Perf counters differs greatly from CPU
  to CPU (even between two CPUs of the same manufacturer), and we want to
  keep using them for profiling guests. And I don't see what problem it
  will solve anyway that can be solved by simple delay between irq
  reinjection.

 This would allow counting the executed instructions and limit it. Thus
 we could emulate a 500MHz CPU on a 2GHz CPU more accurately.

 Why would you want to limit number of instruction executed by guest if
 CPU has nothing else to do anyway? The problem occurs not when we have
 spare cycles so give to a guest, but in opposite case.

I think one problem is that the guest has executed too much compared
to what would happen with real HW with a lesser CPU. That explains the
RTC frequency reprogramming case.


 
  
And even if the rate 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Gleb Natapov
On Sat, May 29, 2010 at 04:03:22PM +, Blue Swirl wrote:
 2010/5/29 Gleb Natapov g...@redhat.com:
  On Sat, May 29, 2010 at 09:15:11AM +, Blue Swirl wrote:
   There is no code, because we're still at architecture design stage.
  
   Try to write test code to understand the problem better.
 
  I will.
 
  Please do ASAP. This discussion shows that you don't understand what is the
  problem that we are dialing with.
 
 Which part of the problem you think I don't understand?
 
It seams to me you don't understand how Windows uses RTC for time
keeping and how the QEMU solves the problem today.

 guests could also be assisted with special handling (like win2k
 install hack), for example guest instructions could be counted
 (approximately, for example using TB size or TSC) and only inject
 after at least N instructions have passed.
 Guest instructions cannot be easily counted in KVM (it can be done 
 more
 or less reliably using perf counters, may be).
   
Aren't there any debug registers or perf counters, which can generate
an interrupt after some number of instructions have been executed?
Don't think debug registers have something like that and they are
available for guest use anyway. Perf counters differs greatly from CPU
to CPU (even between two CPUs of the same manufacturer), and we want 
to
keep using them for profiling guests. And I don't see what problem it
will solve anyway that can be solved by simple delay between irq
reinjection.
  
   This would allow counting the executed instructions and limit it. Thus
   we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
  
   Why would you want to limit number of instruction executed by guest if
   CPU has nothing else to do anyway? The problem occurs not when we have
   spare cycles so give to a guest, but in opposite case.
 
  I think one problem is that the guest has executed too much compared
  to what would happen with real HW with a lesser CPU. That explains the
  RTC frequency reprogramming case.
  You think wrong. The problem is exactly opposite: the guest haven't
  had enough execution time between two time interrupts. I don't know what
  RTC frequency reprogramming case you are talking about here.
 
 The case you told me where N pending tick IRQs exist but the guest
 wants to change the RTC frequency from 64Hz to 1024Hz.
 
 Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
 10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
 the frequency to 1000Hz.
 
 The problem for emulation is that for the same 3 ticks, there has been
 so little execution power that the ticks have been coalesced. But
 isn't the guest cycle count then much lower than 30Mcyc?
 
 Isn't it so that the guest must be above 30Mcyc to be able to want the
 change? But if we reach that point,  the problem must have not been
 too little execution time, but too much.
 
Sorry I tried hard to understand what have you said above but failed.
What do you mean to be able to want the change? Guest sometimes wants
to get 64 timer interrupts per second and sometimes it wants to get 1024
timer interrupt per second. It wants it not as a result of time drift or
anything. It's just how guest behaves. You seams to be to fixated on
guest frequency change. It's just something you have to take into
account when you reinject interrupts.


 
 
  
   

  And even if the rate did not matter, the APIC woult still have 
  to now
  about the fact that an IRQ is really periodic and does not only 
  appear
  as such for a certain interval. This really does not sound like
  simplifying things or even make them cleaner.

 It would, the voodoo would be contained only in APIC, RTC would be
 just like any other device. With the bidirectional irqs, this 
 voodoo
 would probably eventually spread to many other devices. The 
 logical
 conclusion of that would be a system where all devices would be
 careful not to disturb the guest at wrong moment because that 
 would
 trigger a bug.

 This voodoo will be so complex and unreliable that it will make 
 RTC hack
 pale in comparison (and I still don't see how you are going to 
 make it
 actually work).
   
Implement everything inside APIC: only coalescing and reinjection.
APIC has zero info needed to implement reinjection correctly as was
shown to you several time in this thread and you simply keep ignoring
it.
  
   On the contrary, APIC is actually the only source of the IRQ ack
   information. RTC hack would not work without APIC (or the
   bidirectional IRQ) passing this info to RTC.
  
   What APIC doesn't have now is the timer frequency or period info. This
   is known by RTC and also higher levels managing the clocks.
  
   So APIC has one bit of information and RTC everything else.
 
  The information known by RTC (timer period) is also known by higher 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Blue Swirl
On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov g...@redhat.com wrote:
 On Sat, May 29, 2010 at 09:35:30AM +, Blue Swirl wrote:
  I still don't see how the alternative is supposed to simplify our life
  or improve the efficiency of the de-coalescing workaround. It's rather
  problematic like Gleb pointed out: The de-coalescing logic needs to be
  informed about periodicity changes that can only be delivered along
  IRQs. So what to do with the backlog when the timer is stopped?

 What happens with the current design? Gleb only mentioned the
 frequency change, I thought that was not so big problem. But I don't
 think this case should be allowed happen at all, it can't exist on
 real HW.

 Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
 inside QEMU, connect with gdb to QEMU process and check what frequency
 RTC configured with (hint: it will be 64Hz), now run video inside the
 guest and check frequency again (hint: it will be 1Khz).

You missed the key word 'stopped'. If the timer is really stopped, no
IRQs should ever come out afterwards, just like on real HW. For the
emulation, this means loss of ticks which should have been delivered
before the change.

But what if the guest changed the frequency very often, and between
changes used zero value, like 64Hz - 0Hz - 128Hz - 0Hz - 64Hz?



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Jan Kiszka
Blue Swirl wrote:
 On Sat, May 29, 2010 at 10:16 AM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 This is - according to my current understanding - the proposed
 alternative architecture:

  .---.
  | de-coalescing |
  | logic |
  '---'
^   |
 period,|   |IRQ
   coalesced|   |(or tuned VM clock)
(yes/no)|   v
 .---.  .. .---.
 |  RTC  |-IRQ-| router |-IRQ| APIC  |
 '---'  '' '---'
|^|   ^
|||   |
'---period---''--period---'
 The period information is already known by the higher level clock
 management,
 The timer device models program the next event of some qemu-timer. There
 is no tag attached explaining the timer subsystem or anyone else the
 reason behind this programming.
 Yes, but why would we care? All timers in the system besides RTC
 should be affected likewise, this would in fact be the benefit from
 handling the problem at higher level.
 And how does your equation for calculating the clock slow-down look like?
 
 How about icount_adjust()?

I would suggest that you implement your ideas now. Please keep us
informed about the progress as this series (and more) depends on a decision.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Gleb Natapov
On Sat, May 29, 2010 at 04:13:22PM +, Blue Swirl wrote:
 On Sat, May 29, 2010 at 2:46 PM, Gleb Natapov g...@redhat.com wrote:
  On Sat, May 29, 2010 at 09:35:30AM +, Blue Swirl wrote:
   I still don't see how the alternative is supposed to simplify our life
   or improve the efficiency of the de-coalescing workaround. It's rather
   problematic like Gleb pointed out: The de-coalescing logic needs to be
   informed about periodicity changes that can only be delivered along
   IRQs. So what to do with the backlog when the timer is stopped?
 
  What happens with the current design? Gleb only mentioned the
  frequency change, I thought that was not so big problem. But I don't
  think this case should be allowed happen at all, it can't exist on
  real HW.
 
  Hm, why it can't exist on real HW? Do simple exercise. Run WindowsXP
  inside QEMU, connect with gdb to QEMU process and check what frequency
  RTC configured with (hint: it will be 64Hz), now run video inside the
  guest and check frequency again (hint: it will be 1Khz).
 
 You missed the key word 'stopped'. If the timer is really stopped, no
 IRQs should ever come out afterwards, just like on real HW. For the
 emulation, this means loss of ticks which should have been delivered
 before the change.
 
I haven't missed it. I describe to you reality of the situation. You want
to change reality to be more close to what you want it to be by adding
words to my description. Please just go write code, experiment, debug
and _then_ come here with design.

 But what if the guest changed the frequency very often, and between
 changes used zero value, like 64Hz - 0Hz - 128Hz - 0Hz - 64Hz?
Too bad, the world is not perfect.

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Blue Swirl
On Sat, May 29, 2010 at 4:32 PM, Gleb Natapov g...@redhat.com wrote:
 On Sat, May 29, 2010 at 04:03:22PM +, Blue Swirl wrote:
 2010/5/29 Gleb Natapov g...@redhat.com:
  On Sat, May 29, 2010 at 09:15:11AM +, Blue Swirl wrote:
   There is no code, because we're still at architecture design stage.
  
   Try to write test code to understand the problem better.
 
  I will.
 
  Please do ASAP. This discussion shows that you don't understand what is the
  problem that we are dialing with.

 Which part of the problem you think I don't understand?

 It seams to me you don't understand how Windows uses RTC for time
 keeping and how the QEMU solves the problem today.

RTC causes periodic interrupts and Windows interrupt handler
increments jiffies, like Linux?

 guests could also be assisted with special handling (like win2k
 install hack), for example guest instructions could be counted
 (approximately, for example using TB size or TSC) and only inject
 after at least N instructions have passed.
 Guest instructions cannot be easily counted in KVM (it can be 
 done more
 or less reliably using perf counters, may be).
   
Aren't there any debug registers or perf counters, which can 
generate
an interrupt after some number of instructions have been executed?
Don't think debug registers have something like that and they are
available for guest use anyway. Perf counters differs greatly from 
CPU
to CPU (even between two CPUs of the same manufacturer), and we want 
to
keep using them for profiling guests. And I don't see what problem it
will solve anyway that can be solved by simple delay between irq
reinjection.
  
   This would allow counting the executed instructions and limit it. Thus
   we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
  
   Why would you want to limit number of instruction executed by guest if
   CPU has nothing else to do anyway? The problem occurs not when we have
   spare cycles so give to a guest, but in opposite case.
 
  I think one problem is that the guest has executed too much compared
  to what would happen with real HW with a lesser CPU. That explains the
  RTC frequency reprogramming case.
  You think wrong. The problem is exactly opposite: the guest haven't
  had enough execution time between two time interrupts. I don't know what
  RTC frequency reprogramming case you are talking about here.

 The case you told me where N pending tick IRQs exist but the guest
 wants to change the RTC frequency from 64Hz to 1024Hz.

 Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
 10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
 the frequency to 1000Hz.

 The problem for emulation is that for the same 3 ticks, there has been
 so little execution power that the ticks have been coalesced. But
 isn't the guest cycle count then much lower than 30Mcyc?

 Isn't it so that the guest must be above 30Mcyc to be able to want the
 change? But if we reach that point,  the problem must have not been
 too little execution time, but too much.

 Sorry I tried hard to understand what have you said above but failed.
 What do you mean to be able to want the change? Guest sometimes wants
 to get 64 timer interrupts per second and sometimes it wants to get 1024
 timer interrupt per second. It wants it not as a result of time drift or
 anything. It's just how guest behaves. You seams to be to fixated on
 guest frequency change. It's just something you have to take into
 account when you reinject interrupts.

I meant that in the scenario, the guest won't change the RTC before
30Mcyc because of some built in determinism in the guest. At that
point, because of some reason, the change would happen.


 
 
  
   

  And even if the rate did not matter, the APIC woult still have 
  to now
  about the fact that an IRQ is really periodic and does not 
  only appear
  as such for a certain interval. This really does not sound like
  simplifying things or even make them cleaner.

 It would, the voodoo would be contained only in APIC, RTC would 
 be
 just like any other device. With the bidirectional irqs, this 
 voodoo
 would probably eventually spread to many other devices. The 
 logical
 conclusion of that would be a system where all devices would be
 careful not to disturb the guest at wrong moment because that 
 would
 trigger a bug.

 This voodoo will be so complex and unreliable that it will make 
 RTC hack
 pale in comparison (and I still don't see how you are going to 
 make it
 actually work).
   
Implement everything inside APIC: only coalescing and reinjection.
APIC has zero info needed to implement reinjection correctly as was
shown to you several time in this thread and you simply keep ignoring
it.
  
   On the contrary, APIC is actually the only source of the IRQ ack
   information. 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-29 Thread Gleb Natapov
On Sat, May 29, 2010 at 08:52:34PM +, Blue Swirl wrote:
 On Sat, May 29, 2010 at 4:32 PM, Gleb Natapov g...@redhat.com wrote:
  On Sat, May 29, 2010 at 04:03:22PM +, Blue Swirl wrote:
  2010/5/29 Gleb Natapov g...@redhat.com:
   On Sat, May 29, 2010 at 09:15:11AM +, Blue Swirl wrote:
There is no code, because we're still at architecture design stage.
   
Try to write test code to understand the problem better.
  
   I will.
  
   Please do ASAP. This discussion shows that you don't understand what is 
   the
   problem that we are dialing with.
 
  Which part of the problem you think I don't understand?
 
  It seams to me you don't understand how Windows uses RTC for time
  keeping and how the QEMU solves the problem today.
 
 RTC causes periodic interrupts and Windows interrupt handler
 increments jiffies, like Linux?
 
Linux does much more complicated things than that to keep time, so the
only way to fix time drift in Linux was to introduce pvclock. For Window
it is not so accurate too, since Windows can change clock frequency any
time it can't calculate time from jiffies, it needs to update clock at
each time tick.

  guests could also be assisted with special handling (like win2k
  install hack), for example guest instructions could be counted
  (approximately, for example using TB size or TSC) and only 
  inject
  after at least N instructions have passed.
  Guest instructions cannot be easily counted in KVM (it can be 
  done more
  or less reliably using perf counters, may be).

 Aren't there any debug registers or perf counters, which can 
 generate
 an interrupt after some number of instructions have been executed?
 Don't think debug registers have something like that and they are
 available for guest use anyway. Perf counters differs greatly from 
 CPU
 to CPU (even between two CPUs of the same manufacturer), and we 
 want to
 keep using them for profiling guests. And I don't see what problem 
 it
 will solve anyway that can be solved by simple delay between irq
 reinjection.
   
This would allow counting the executed instructions and limit it. 
Thus
we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
   
Why would you want to limit number of instruction executed by guest if
CPU has nothing else to do anyway? The problem occurs not when we have
spare cycles so give to a guest, but in opposite case.
  
   I think one problem is that the guest has executed too much compared
   to what would happen with real HW with a lesser CPU. That explains the
   RTC frequency reprogramming case.
   You think wrong. The problem is exactly opposite: the guest haven't
   had enough execution time between two time interrupts. I don't know what
   RTC frequency reprogramming case you are talking about here.
 
  The case you told me where N pending tick IRQs exist but the guest
  wants to change the RTC frequency from 64Hz to 1024Hz.
 
  Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
  10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
  the frequency to 1000Hz.
 
  The problem for emulation is that for the same 3 ticks, there has been
  so little execution power that the ticks have been coalesced. But
  isn't the guest cycle count then much lower than 30Mcyc?
 
  Isn't it so that the guest must be above 30Mcyc to be able to want the
  change? But if we reach that point,  the problem must have not been
  too little execution time, but too much.
 
  Sorry I tried hard to understand what have you said above but failed.
  What do you mean to be able to want the change? Guest sometimes wants
  to get 64 timer interrupts per second and sometimes it wants to get 1024
  timer interrupt per second. It wants it not as a result of time drift or
  anything. It's just how guest behaves. You seams to be to fixated on
  guest frequency change. It's just something you have to take into
  account when you reinject interrupts.
 
 I meant that in the scenario, the guest won't change the RTC before
 30Mcyc because of some built in determinism in the guest. At that
 point, because of some reason, the change would happen.
 
I still don't understand what are you trying to say here. Guest changes
frequency because of some even in the guest. It is totally independent
of what happens in QEMUs RTC emulation.

 
  
  
   

 
   And even if the rate did not matter, the APIC woult still 
   have to now
   about the fact that an IRQ is really periodic and does not 
   only appear
   as such for a certain interval. This really does not sound 
   like
   simplifying things or even make them cleaner.
 
  It would, the voodoo would be contained only in APIC, RTC 
  would be
  just like any other device. With the bidirectional irqs, this 
  voodoo
  would probably eventually spread to many other devices. The 
  

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-28 Thread Gleb Natapov
On Thu, May 27, 2010 at 06:37:10PM +, Blue Swirl wrote:
 2010/5/27 Gleb Natapov g...@redhat.com:
  On Wed, May 26, 2010 at 08:35:00PM +, Blue Swirl wrote:
  On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka jan.kis...@web.de wrote:
   Blue Swirl wrote:
   On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka jan.kis...@web.de wrote:
   Anthony Liguori wrote:
   On 05/25/2010 02:09 PM, Blue Swirl wrote:
   On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  
   wrote:
  
   From: Jan Kiszkajan.kis...@siemens.com
  
   This allows to communicate potential IRQ coalescing during delivery 
   from
   the sink back to the source. Targets that support IRQ coalescing
   workarounds need to register handlers that return the appropriate
   QEMU_IRQ_* code, and they have to propergate the code across all IRQ
   redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it 
   can
   apply its workaround. If multiple sinks exist, the source may only
   consider an IRQ coalesced if all other sinks either report
   QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
  
   No real devices are interested whether any of their output lines are
   even connected. This would introduce a new signal type, bidirectional
   multi-level, which is not correct.
  
   I don't think it's really an issue of correct, but I wouldn't disagree
   to a suggestion that we ought to introduce a new signal type for this
   type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and 
   has a
   similar interface as qemu_irq.
   A separate type would complicate the delivery of the feedback value
   across GPIO pins (as Paul requested for the RTC-HPET routing).
  
   I think the real solution to coalescing is put the logic inside one
   device, in this case APIC because it has the information about irq
   delivery. APIC could monitor incoming RTC irqs for frequency
   information and whether they get delivered or not. If not, an 
   internal
   timer is installed which injects the lost irqs.
   That won't fly as the IRQs will already arrive at the APIC with a
   sufficiently high jitter. At the bare minimum, you need to tell the
   interrupt controller about the fact that a particular IRQ should be
   delivered at a specific regular rate. For this, you also need a generic
   interface - nothing really won.
  
   OK, let's simplify: just reinject at next possible chance. No need to
   monitor or tell anything.
  
   There are guests that won't like this (I know of one in-house, but
   others may even have more examples), specifically if you end up firing
   multiple IRQs in a row due to a longer backlog. For that reason, the RTC
   spreads the reinjection according to the current rate.
 
  Then reinject with a constant delay, or next CPU exit. Such buggy
  If guest's time frequency is the same as host time frequency you can't
  reinject with constant delay. That is why current code mixes two
  approaches: reinject M interrupts in a raw then delay.
 
 This approach can be also used by APIC-only version.
 
I don't know what APIC-only version you are talking about. I haven't
seen the code and I don't understand hand waving, sorry.

  guests could also be assisted with special handling (like win2k
  install hack), for example guest instructions could be counted
  (approximately, for example using TB size or TSC) and only inject
  after at least N instructions have passed.
  Guest instructions cannot be easily counted in KVM (it can be done more
  or less reliably using perf counters, may be).
 
 Aren't there any debug registers or perf counters, which can generate
 an interrupt after some number of instructions have been executed?
Don't think debug registers have something like that and they are
available for guest use anyway. Perf counters differs greatly from CPU
to CPU (even between two CPUs of the same manufacturer), and we want to
keep using them for profiling guests. And I don't see what problem it
will solve anyway that can be solved by simple delay between irq
reinjection.

 
 
   And even if the rate did not matter, the APIC woult still have to now
   about the fact that an IRQ is really periodic and does not only appear
   as such for a certain interval. This really does not sound like
   simplifying things or even make them cleaner.
 
  It would, the voodoo would be contained only in APIC, RTC would be
  just like any other device. With the bidirectional irqs, this voodoo
  would probably eventually spread to many other devices. The logical
  conclusion of that would be a system where all devices would be
  careful not to disturb the guest at wrong moment because that would
  trigger a bug.
 
  This voodoo will be so complex and unreliable that it will make RTC hack
  pale in comparison (and I still don't see how you are going to make it
  actually work).
 
 Implement everything inside APIC: only coalescing and reinjection.
APIC has zero info needed to implement reinjection correctly as was
shown to you several time in this thread 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-28 Thread Blue Swirl
On Thu, May 27, 2010 at 10:19 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Thu, May 27, 2010 at 7:08 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Wed, May 26, 2010 at 11:26 PM, Paul Brook p...@codesourcery.com 
 wrote:
 At the other extreme, would it be possible to make the educated guests
 aware of the virtualization also in clock aspect: virtio-clock?
 The guest doesn't even need to be aware of virtualization. It just 
 needs to be
 able to accommodate the lack of guaranteed realtime behavior.

 The fundamental problem here is that some guest operating systems 
 assume that
 the hardware provides certain realtime guarantees with respect to 
 execution of
 interrupt handlers.  In particular they assume that the CPU will always 
 be
 able to complete execution of the timer IRQ handler before the periodic 
 timer
 triggers again.  In most virtualized environments you have absolutely no
 guarantee of realtime response.

 With Linux guests this was solved a long time ago by the introduction of
 tickless kernels.  These separate the timekeeping from wakeup events, 
 so it
 doesn't matter if several wakeup triggers end up getting merged (either 
 at the
 hardware level or via top/bottom half guest IRQ handlers).


 It's worth mentioning that this problem also occurs on real hardware,
 typically due to lame hardware/drivers which end up masking interrupts 
 or
 otherwise stall the CPU for for long periods of time.


 The PIT hack attempts to workaround broken guests by adding artificial 
 latency
 to the timer event, ensuring that the guest sees them all.  
 Unfortunately
 guests vary on when it is safe for them to see the next timer event, and
 trying to observe this behavior involves potentially harmful heuristics 
 and
 collusion between unrelated devices (e.g. interrupt controller and 
 timer).

 In some cases we don't even do that, and just reschedule the event some
 arbitrarily small amount of time later. This assumes the guest to do 
 useful
 work in that time. In a single threaded environment this is probably 
 true -
 qemu got enough CPU to inject the first interrupt, so will probably 
 manage to
 execute some guest code before the end of its timeslice. In an 
 environment
 where interrupt processing/delivery and execution of the guest code 
 happen in
 different threads this becomes increasingly likely to fail.
 So any voodoo around timer events is doomed to fail in some cases.
 What's the amount of hacks what we want then? Is there any generic
 The aim of this patch is to reduce the amount of existing and upcoming
 hacks. It may still require some refinements, but I think we haven't
 found any smarter approach yet that fits existing use cases.
 I don't feel we have tried other possibilities hard enough.
 Well, seeing prototypes wouldn't be bad, also to run real load againt
 them. But at least I'm currently clueless what to implement.

 Perhaps now is then not the time to rush to implement something, but
 to brainstorm for a clean solution.

 And sometimes it can help to understand how ideas could even be improved
 or why others doesn't work at all.


 solution, like slowing down the guest system to the point where we can
 guarantee the interrupt rate vs. CPU execution speed?
 That's generally a non-option in virtualized production environments.
 Specifically if the guest system lost interrupts due to host
 overcommitment, you do not want it slow down even further.
 I meant that the guest time could be scaled down, for example 2s in
 wall clock time would be presented to the guest as 1s.
 But that is precisely what already happens when the guest loses timer
 interrupts. There is no other time source for this kind of guests -
 often except for some external events generated by systems which you
 don't want to fall behind arbitrarily.

 Then the amount
 of CPU cycles between timer interrupts would increase and hopefully
 the guest can keep up. If the guest sleeps, time base could be
 accelerated to catch up with wall clock and then set back to 1:1 rate.
 Can't follow you ATM, sorry. What should be slowed down then? And how
 precisely?

 I think vm_clock and everything that depends on vm_clock, also
 rtc_clock should be tied to vm_clock in this mode, not host_clock.

 Let me check if I got this idea correctly: Instead of tuning just the
 tick frequency of the affected timer device / sending its backlog in a
 row, you rather want to tune the vm_clock correspondingly? Maybe a way
 to abstract the required logic currently sitting only in the RTC for use
 by other timer sources as well.

Yes, that would be a good starting point.

 But just switching rtc_clock to vm_clock when the user wants host_clock
 is obviously not an option. We would rather have to tune host_clock in
 parallel.

 Still, this does not answer:

 - How do you want to detect lost timer ticks?

With APIC, just like 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-28 Thread Blue Swirl
On Thu, May 27, 2010 at 10:21 PM, Paul Brook p...@codesourcery.com wrote:

  Then the amount
  of CPU cycles between timer interrupts would increase and hopefully
  the guest can keep up. If the guest sleeps, time base could be
  accelerated to catch up with wall clock and then set back to 1:1 rate.
 
  Can't follow you ATM, sorry. What should be slowed down then? And how
  precisely?

 I think vm_clock and everything that depends on vm_clock, also
 rtc_clock should be tied to vm_clock in this mode, not host_clock.

 The problem is more fundamental than that. There is no real correlation
 between vm_clock and the amount of code executed by the guest, especially not
 on timescales less than a second.

Can we measure (or at least estimate with higher accuracy than the
tick IRQ delivery jitter) the amount of code executed, somehow? For
example, add TSC sampling to all TB or KVM VCPU exit and load/store
paths?



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-28 Thread Blue Swirl
2010/5/28 Gleb Natapov g...@redhat.com:
 On Thu, May 27, 2010 at 06:37:10PM +, Blue Swirl wrote:
 2010/5/27 Gleb Natapov g...@redhat.com:
  On Wed, May 26, 2010 at 08:35:00PM +, Blue Swirl wrote:
  On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka jan.kis...@web.de wrote:
   Blue Swirl wrote:
   On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka jan.kis...@web.de wrote:
   Anthony Liguori wrote:
   On 05/25/2010 02:09 PM, Blue Swirl wrote:
   On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  
   wrote:
  
   From: Jan Kiszkajan.kis...@siemens.com
  
   This allows to communicate potential IRQ coalescing during 
   delivery from
   the sink back to the source. Targets that support IRQ coalescing
   workarounds need to register handlers that return the appropriate
   QEMU_IRQ_* code, and they have to propergate the code across all 
   IRQ
   redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it 
   can
   apply its workaround. If multiple sinks exist, the source may only
   consider an IRQ coalesced if all other sinks either report
   QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
  
   No real devices are interested whether any of their output lines are
   even connected. This would introduce a new signal type, 
   bidirectional
   multi-level, which is not correct.
  
   I don't think it's really an issue of correct, but I wouldn't 
   disagree
   to a suggestion that we ought to introduce a new signal type for this
   type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and 
   has a
   similar interface as qemu_irq.
   A separate type would complicate the delivery of the feedback value
   across GPIO pins (as Paul requested for the RTC-HPET routing).
  
   I think the real solution to coalescing is put the logic inside one
   device, in this case APIC because it has the information about irq
   delivery. APIC could monitor incoming RTC irqs for frequency
   information and whether they get delivered or not. If not, an 
   internal
   timer is installed which injects the lost irqs.
   That won't fly as the IRQs will already arrive at the APIC with a
   sufficiently high jitter. At the bare minimum, you need to tell the
   interrupt controller about the fact that a particular IRQ should be
   delivered at a specific regular rate. For this, you also need a 
   generic
   interface - nothing really won.
  
   OK, let's simplify: just reinject at next possible chance. No need to
   monitor or tell anything.
  
   There are guests that won't like this (I know of one in-house, but
   others may even have more examples), specifically if you end up firing
   multiple IRQs in a row due to a longer backlog. For that reason, the RTC
   spreads the reinjection according to the current rate.
 
  Then reinject with a constant delay, or next CPU exit. Such buggy
  If guest's time frequency is the same as host time frequency you can't
  reinject with constant delay. That is why current code mixes two
  approaches: reinject M interrupts in a raw then delay.

 This approach can be also used by APIC-only version.

 I don't know what APIC-only version you are talking about. I haven't
 seen the code and I don't understand hand waving, sorry.

There is no code, because we're still at architecture design stage.

  guests could also be assisted with special handling (like win2k
  install hack), for example guest instructions could be counted
  (approximately, for example using TB size or TSC) and only inject
  after at least N instructions have passed.
  Guest instructions cannot be easily counted in KVM (it can be done more
  or less reliably using perf counters, may be).

 Aren't there any debug registers or perf counters, which can generate
 an interrupt after some number of instructions have been executed?
 Don't think debug registers have something like that and they are
 available for guest use anyway. Perf counters differs greatly from CPU
 to CPU (even between two CPUs of the same manufacturer), and we want to
 keep using them for profiling guests. And I don't see what problem it
 will solve anyway that can be solved by simple delay between irq
 reinjection.

This would allow counting the executed instructions and limit it. Thus
we could emulate a 500MHz CPU on a 2GHz CPU more accurately.


 
   And even if the rate did not matter, the APIC woult still have to now
   about the fact that an IRQ is really periodic and does not only appear
   as such for a certain interval. This really does not sound like
   simplifying things or even make them cleaner.
 
  It would, the voodoo would be contained only in APIC, RTC would be
  just like any other device. With the bidirectional irqs, this voodoo
  would probably eventually spread to many other devices. The logical
  conclusion of that would be a system where all devices would be
  careful not to disturb the guest at wrong moment because that would
  trigger a bug.
 
  This voodoo will be so complex and unreliable that it will make RTC hack
  pale 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-28 Thread Gleb Natapov
On Fri, May 28, 2010 at 08:06:45PM +, Blue Swirl wrote:
 2010/5/28 Gleb Natapov g...@redhat.com:
  On Thu, May 27, 2010 at 06:37:10PM +, Blue Swirl wrote:
  2010/5/27 Gleb Natapov g...@redhat.com:
   On Wed, May 26, 2010 at 08:35:00PM +, Blue Swirl wrote:
   On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka jan.kis...@web.de wrote:
Blue Swirl wrote:
On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka jan.kis...@web.de 
wrote:
Anthony Liguori wrote:
On 05/25/2010 02:09 PM, Blue Swirl wrote:
On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  
wrote:
   
From: Jan Kiszkajan.kis...@siemens.com
   
This allows to communicate potential IRQ coalescing during 
delivery from
the sink back to the source. Targets that support IRQ coalescing
workarounds need to register handlers that return the appropriate
QEMU_IRQ_* code, and they have to propergate the code across all 
IRQ
redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, 
it can
apply its workaround. If multiple sinks exist, the source may 
only
consider an IRQ coalesced if all other sinks either report
QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
   
No real devices are interested whether any of their output lines 
are
even connected. This would introduce a new signal type, 
bidirectional
multi-level, which is not correct.
   
I don't think it's really an issue of correct, but I wouldn't 
disagree
to a suggestion that we ought to introduce a new signal type for 
this
type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and 
has a
similar interface as qemu_irq.
A separate type would complicate the delivery of the feedback value
across GPIO pins (as Paul requested for the RTC-HPET routing).
   
I think the real solution to coalescing is put the logic inside 
one
device, in this case APIC because it has the information about irq
delivery. APIC could monitor incoming RTC irqs for frequency
information and whether they get delivered or not. If not, an 
internal
timer is installed which injects the lost irqs.
That won't fly as the IRQs will already arrive at the APIC with a
sufficiently high jitter. At the bare minimum, you need to tell the
interrupt controller about the fact that a particular IRQ should be
delivered at a specific regular rate. For this, you also need a 
generic
interface - nothing really won.
   
OK, let's simplify: just reinject at next possible chance. No need to
monitor or tell anything.
   
There are guests that won't like this (I know of one in-house, but
others may even have more examples), specifically if you end up firing
multiple IRQs in a row due to a longer backlog. For that reason, the 
RTC
spreads the reinjection according to the current rate.
  
   Then reinject with a constant delay, or next CPU exit. Such buggy
   If guest's time frequency is the same as host time frequency you can't
   reinject with constant delay. That is why current code mixes two
   approaches: reinject M interrupts in a raw then delay.
 
  This approach can be also used by APIC-only version.
 
  I don't know what APIC-only version you are talking about. I haven't
  seen the code and I don't understand hand waving, sorry.
 
 There is no code, because we're still at architecture design stage.
 
Try to write test code to understand the problem better.

   guests could also be assisted with special handling (like win2k
   install hack), for example guest instructions could be counted
   (approximately, for example using TB size or TSC) and only inject
   after at least N instructions have passed.
   Guest instructions cannot be easily counted in KVM (it can be done more
   or less reliably using perf counters, may be).
 
  Aren't there any debug registers or perf counters, which can generate
  an interrupt after some number of instructions have been executed?
  Don't think debug registers have something like that and they are
  available for guest use anyway. Perf counters differs greatly from CPU
  to CPU (even between two CPUs of the same manufacturer), and we want to
  keep using them for profiling guests. And I don't see what problem it
  will solve anyway that can be solved by simple delay between irq
  reinjection.
 
 This would allow counting the executed instructions and limit it. Thus
 we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
 
Why would you want to limit number of instruction executed by guest if
CPU has nothing else to do anyway? The problem occurs not when we have
spare cycles so give to a guest, but in opposite case.

 
  
And even if the rate did not matter, the APIC woult still have to now
about the fact that an IRQ is really periodic and does not only appear
as such for a certain interval. This really does not sound like
simplifying things or even make them cleaner.
  
   

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-27 Thread Gleb Natapov
On Wed, May 26, 2010 at 10:09:52PM +0200, Jan Kiszka wrote:
 Blue Swirl wrote:
  On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka jan.kis...@web.de wrote:
  Anthony Liguori wrote:
  On 05/25/2010 02:09 PM, Blue Swirl wrote:
  On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  wrote:
 
  From: Jan Kiszkajan.kis...@siemens.com
 
  This allows to communicate potential IRQ coalescing during delivery from
  the sink back to the source. Targets that support IRQ coalescing
  workarounds need to register handlers that return the appropriate
  QEMU_IRQ_* code, and they have to propergate the code across all IRQ
  redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
  apply its workaround. If multiple sinks exist, the source may only
  consider an IRQ coalesced if all other sinks either report
  QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
 
  No real devices are interested whether any of their output lines are
  even connected. This would introduce a new signal type, bidirectional
  multi-level, which is not correct.
 
  I don't think it's really an issue of correct, but I wouldn't disagree
  to a suggestion that we ought to introduce a new signal type for this
  type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
  similar interface as qemu_irq.
  A separate type would complicate the delivery of the feedback value
  across GPIO pins (as Paul requested for the RTC-HPET routing).
 
  I think the real solution to coalescing is put the logic inside one
  device, in this case APIC because it has the information about irq
  delivery. APIC could monitor incoming RTC irqs for frequency
  information and whether they get delivered or not. If not, an internal
  timer is installed which injects the lost irqs.
  That won't fly as the IRQs will already arrive at the APIC with a
  sufficiently high jitter. At the bare minimum, you need to tell the
  interrupt controller about the fact that a particular IRQ should be
  delivered at a specific regular rate. For this, you also need a generic
  interface - nothing really won.
  
  OK, let's simplify: just reinject at next possible chance. No need to
  monitor or tell anything.
 
 There are guests that won't like this (I know of one in-house, but
 others may even have more examples), specifically if you end up firing
 multiple IRQs in a row due to a longer backlog. For that reason, the RTC
 spreads the reinjection according to the current rate.
 
No need to look far for such guests. See commit dd17765b5f77ca.

 And even if the rate did not matter, the APIC woult still have to now
 about the fact that an IRQ is really periodic and does not only appear
 as such for a certain interval. This really does not sound like
 simplifying things or even make them cleaner.
 
 Jan
 



--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-27 Thread Gleb Natapov
On Wed, May 26, 2010 at 08:35:00PM +, Blue Swirl wrote:
 On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka jan.kis...@web.de wrote:
  Blue Swirl wrote:
  On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka jan.kis...@web.de wrote:
  Anthony Liguori wrote:
  On 05/25/2010 02:09 PM, Blue Swirl wrote:
  On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  wrote:
 
  From: Jan Kiszkajan.kis...@siemens.com
 
  This allows to communicate potential IRQ coalescing during delivery 
  from
  the sink back to the source. Targets that support IRQ coalescing
  workarounds need to register handlers that return the appropriate
  QEMU_IRQ_* code, and they have to propergate the code across all IRQ
  redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
  apply its workaround. If multiple sinks exist, the source may only
  consider an IRQ coalesced if all other sinks either report
  QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
 
  No real devices are interested whether any of their output lines are
  even connected. This would introduce a new signal type, bidirectional
  multi-level, which is not correct.
 
  I don't think it's really an issue of correct, but I wouldn't disagree
  to a suggestion that we ought to introduce a new signal type for this
  type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
  similar interface as qemu_irq.
  A separate type would complicate the delivery of the feedback value
  across GPIO pins (as Paul requested for the RTC-HPET routing).
 
  I think the real solution to coalescing is put the logic inside one
  device, in this case APIC because it has the information about irq
  delivery. APIC could monitor incoming RTC irqs for frequency
  information and whether they get delivered or not. If not, an internal
  timer is installed which injects the lost irqs.
  That won't fly as the IRQs will already arrive at the APIC with a
  sufficiently high jitter. At the bare minimum, you need to tell the
  interrupt controller about the fact that a particular IRQ should be
  delivered at a specific regular rate. For this, you also need a generic
  interface - nothing really won.
 
  OK, let's simplify: just reinject at next possible chance. No need to
  monitor or tell anything.
 
  There are guests that won't like this (I know of one in-house, but
  others may even have more examples), specifically if you end up firing
  multiple IRQs in a row due to a longer backlog. For that reason, the RTC
  spreads the reinjection according to the current rate.
 
 Then reinject with a constant delay, or next CPU exit. Such buggy
If guest's time frequency is the same as host time frequency you can't
reinject with constant delay. That is why current code mixes two
approaches: reinject M interrupts in a raw then delay.

 guests could also be assisted with special handling (like win2k
 install hack), for example guest instructions could be counted
 (approximately, for example using TB size or TSC) and only inject
 after at least N instructions have passed.
Guest instructions cannot be easily counted in KVM (it can be done more
or less reliably using perf counters, may be).

 
  And even if the rate did not matter, the APIC woult still have to now
  about the fact that an IRQ is really periodic and does not only appear
  as such for a certain interval. This really does not sound like
  simplifying things or even make them cleaner.
 
 It would, the voodoo would be contained only in APIC, RTC would be
 just like any other device. With the bidirectional irqs, this voodoo
 would probably eventually spread to many other devices. The logical
 conclusion of that would be a system where all devices would be
 careful not to disturb the guest at wrong moment because that would
 trigger a bug.
 
This voodoo will be so complex and unreliable that it will make RTC hack
pale in comparison (and I still don't see how you are going to make it
actually work). The fact is that timer device is not just like any
other device in virtual world. Any other device is easy: you just
implement spec as close as possible and everything works. For time
source device this is not enough. You can implement RTC+HPET to the
letter and your guest will drift like crazy.

 At the other extreme, would it be possible to make the educated guests
 aware of the virtualization also in clock aspect: virtio-clock?
No.

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-27 Thread Gleb Natapov
On Wed, May 26, 2010 at 08:14:38PM +, Blue Swirl wrote:
 On Wed, May 26, 2010 at 8:08 AM, Gleb Natapov g...@redhat.com wrote:
  On Tue, May 25, 2010 at 11:44:50PM +0200, Jan Kiszka wrote:
  
   I think the real solution to coalescing is put the logic inside one
   device, in this case APIC because it has the information about irq
   delivery. APIC could monitor incoming RTC irqs for frequency
   information and whether they get delivered or not. If not, an internal
   timer is installed which injects the lost irqs.
 
  That won't fly as the IRQs will already arrive at the APIC with a
  sufficiently high jitter. At the bare minimum, you need to tell the
  interrupt controller about the fact that a particular IRQ should be
  delivered at a specific regular rate. For this, you also need a generic
  interface - nothing really won.
 
  Not only that. Suppose RTC runs with 64Hz frequency and N interrupts
  were coalesced during some period. Now the guest reprograms RTC to
  1024Hz. N should be scaled accordingly otherwise reinjection will not
  fix the drift. Do you propose to put this logic into APIC to?
 
 Interesting case, I don't think this is handled by the current code
 either.
Of course it is:
if(period != s-period)
s-irq_coalesced = (s-irq_coalesced * s-period) / period;

Qemu is not a toy anymore and dials with real loads in production now.

  Could this happen if the target machine does not have HPET and
 the guest runs a tickless kernel?
 
Windows doesn't have tickles kernel and Linux does not use RTC as time
source.

 I think the guest would be buggy to reprogram the timer without
 checking that the interrupt from the previous timer frequency won't
 interfere, for example by stopping the clock, or doing the
 reprogramming only at timer interrupt handler. Otherwise the period
 may be unpredictable for one period, which means that the time base
 has been lost.
Guest knows nothing about interrupt that was not yet delivered. And
Windows doesn't stop clock while changing frequency.

 
 But let's consider how this could be handled with the current code (or
 with the magical interrupts). When doing the scaling, the guest, while
 reprogramming, is unaware of the old queued interrupts with the
 previous rate. Why would scaling these be more correct? I'd think the
 old ones should be just reinjected ASAP without any scaling.
If old once will be reinjected as is time calculation in the guest
will be incorrect. Windows is really stupid when it comes to time
keeping. It just count interrupts and multiplies them by clock frequency
to calculate the time. So if qemu accumulated 64 coalesced interrupts
while clock frequency was 64Hz it means that the guest 1 second behind
the time. If the guest change frequency to 1024hz reinfecting 64 interrupts
will not fix guest time, 1024 interrupts should be injected instead.

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-27 Thread Blue Swirl
On Wed, May 26, 2010 at 11:26 PM, Paul Brook p...@codesourcery.com wrote:
 At the other extreme, would it be possible to make the educated guests
 aware of the virtualization also in clock aspect: virtio-clock?

 The guest doesn't even need to be aware of virtualization. It just needs to be
 able to accommodate the lack of guaranteed realtime behavior.

 The fundamental problem here is that some guest operating systems assume that
 the hardware provides certain realtime guarantees with respect to execution of
 interrupt handlers.  In particular they assume that the CPU will always be
 able to complete execution of the timer IRQ handler before the periodic timer
 triggers again.  In most virtualized environments you have absolutely no
 guarantee of realtime response.

 With Linux guests this was solved a long time ago by the introduction of
 tickless kernels.  These separate the timekeeping from wakeup events, so it
 doesn't matter if several wakeup triggers end up getting merged (either at the
 hardware level or via top/bottom half guest IRQ handlers).


 It's worth mentioning that this problem also occurs on real hardware,
 typically due to lame hardware/drivers which end up masking interrupts or
 otherwise stall the CPU for for long periods of time.


 The PIT hack attempts to workaround broken guests by adding artificial latency
 to the timer event, ensuring that the guest sees them all.  Unfortunately
 guests vary on when it is safe for them to see the next timer event, and
 trying to observe this behavior involves potentially harmful heuristics and
 collusion between unrelated devices (e.g. interrupt controller and timer).

 In some cases we don't even do that, and just reschedule the event some
 arbitrarily small amount of time later. This assumes the guest to do useful
 work in that time. In a single threaded environment this is probably true -
 qemu got enough CPU to inject the first interrupt, so will probably manage to
 execute some guest code before the end of its timeslice. In an environment
 where interrupt processing/delivery and execution of the guest code happen in
 different threads this becomes increasingly likely to fail.

So any voodoo around timer events is doomed to fail in some cases.
What's the amount of hacks what we want then? Is there any generic
solution, like slowing down the guest system to the point where we can
guarantee the interrupt rate vs. CPU execution speed?



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-27 Thread Blue Swirl
2010/5/27 Gleb Natapov g...@redhat.com:
 On Wed, May 26, 2010 at 08:35:00PM +, Blue Swirl wrote:
 On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka jan.kis...@web.de wrote:
  Blue Swirl wrote:
  On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka jan.kis...@web.de wrote:
  Anthony Liguori wrote:
  On 05/25/2010 02:09 PM, Blue Swirl wrote:
  On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  wrote:
 
  From: Jan Kiszkajan.kis...@siemens.com
 
  This allows to communicate potential IRQ coalescing during delivery 
  from
  the sink back to the source. Targets that support IRQ coalescing
  workarounds need to register handlers that return the appropriate
  QEMU_IRQ_* code, and they have to propergate the code across all IRQ
  redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
  apply its workaround. If multiple sinks exist, the source may only
  consider an IRQ coalesced if all other sinks either report
  QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
 
  No real devices are interested whether any of their output lines are
  even connected. This would introduce a new signal type, bidirectional
  multi-level, which is not correct.
 
  I don't think it's really an issue of correct, but I wouldn't disagree
  to a suggestion that we ought to introduce a new signal type for this
  type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
  similar interface as qemu_irq.
  A separate type would complicate the delivery of the feedback value
  across GPIO pins (as Paul requested for the RTC-HPET routing).
 
  I think the real solution to coalescing is put the logic inside one
  device, in this case APIC because it has the information about irq
  delivery. APIC could monitor incoming RTC irqs for frequency
  information and whether they get delivered or not. If not, an internal
  timer is installed which injects the lost irqs.
  That won't fly as the IRQs will already arrive at the APIC with a
  sufficiently high jitter. At the bare minimum, you need to tell the
  interrupt controller about the fact that a particular IRQ should be
  delivered at a specific regular rate. For this, you also need a generic
  interface - nothing really won.
 
  OK, let's simplify: just reinject at next possible chance. No need to
  monitor or tell anything.
 
  There are guests that won't like this (I know of one in-house, but
  others may even have more examples), specifically if you end up firing
  multiple IRQs in a row due to a longer backlog. For that reason, the RTC
  spreads the reinjection according to the current rate.

 Then reinject with a constant delay, or next CPU exit. Such buggy
 If guest's time frequency is the same as host time frequency you can't
 reinject with constant delay. That is why current code mixes two
 approaches: reinject M interrupts in a raw then delay.

This approach can be also used by APIC-only version.

 guests could also be assisted with special handling (like win2k
 install hack), for example guest instructions could be counted
 (approximately, for example using TB size or TSC) and only inject
 after at least N instructions have passed.
 Guest instructions cannot be easily counted in KVM (it can be done more
 or less reliably using perf counters, may be).

Aren't there any debug registers or perf counters, which can generate
an interrupt after some number of instructions have been executed?


  And even if the rate did not matter, the APIC woult still have to now
  about the fact that an IRQ is really periodic and does not only appear
  as such for a certain interval. This really does not sound like
  simplifying things or even make them cleaner.

 It would, the voodoo would be contained only in APIC, RTC would be
 just like any other device. With the bidirectional irqs, this voodoo
 would probably eventually spread to many other devices. The logical
 conclusion of that would be a system where all devices would be
 careful not to disturb the guest at wrong moment because that would
 trigger a bug.

 This voodoo will be so complex and unreliable that it will make RTC hack
 pale in comparison (and I still don't see how you are going to make it
 actually work).

Implement everything inside APIC: only coalescing and reinjection.
Maybe that version would not bend backwards as much as the current to
cater for buggy hosts.

 The fact is that timer device is not just like any
 other device in virtual world. Any other device is easy: you just
 implement spec as close as possible and everything works. For time
 source device this is not enough. You can implement RTC+HPET to the
 letter and your guest will drift like crazy.

It's doable: a cycle accurate emulator will not cause any drift,
without any voodoo. The interrupts would come after executing the same
instruction as the real HW. For emulating any sufficiently buggy
guests in any sufficiently desperate low resource conditions, this may
be the only option that will always work.


 At the other extreme, would it be possible 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-27 Thread Jan Kiszka
Blue Swirl wrote:
 On Wed, May 26, 2010 at 11:26 PM, Paul Brook p...@codesourcery.com wrote:
 At the other extreme, would it be possible to make the educated guests
 aware of the virtualization also in clock aspect: virtio-clock?
 The guest doesn't even need to be aware of virtualization. It just needs to 
 be
 able to accommodate the lack of guaranteed realtime behavior.

 The fundamental problem here is that some guest operating systems assume that
 the hardware provides certain realtime guarantees with respect to execution 
 of
 interrupt handlers.  In particular they assume that the CPU will always be
 able to complete execution of the timer IRQ handler before the periodic timer
 triggers again.  In most virtualized environments you have absolutely no
 guarantee of realtime response.

 With Linux guests this was solved a long time ago by the introduction of
 tickless kernels.  These separate the timekeeping from wakeup events, so it
 doesn't matter if several wakeup triggers end up getting merged (either at 
 the
 hardware level or via top/bottom half guest IRQ handlers).


 It's worth mentioning that this problem also occurs on real hardware,
 typically due to lame hardware/drivers which end up masking interrupts or
 otherwise stall the CPU for for long periods of time.


 The PIT hack attempts to workaround broken guests by adding artificial 
 latency
 to the timer event, ensuring that the guest sees them all.  Unfortunately
 guests vary on when it is safe for them to see the next timer event, and
 trying to observe this behavior involves potentially harmful heuristics and
 collusion between unrelated devices (e.g. interrupt controller and timer).

 In some cases we don't even do that, and just reschedule the event some
 arbitrarily small amount of time later. This assumes the guest to do useful
 work in that time. In a single threaded environment this is probably true -
 qemu got enough CPU to inject the first interrupt, so will probably manage to
 execute some guest code before the end of its timeslice. In an environment
 where interrupt processing/delivery and execution of the guest code happen in
 different threads this becomes increasingly likely to fail.
 
 So any voodoo around timer events is doomed to fail in some cases.
 What's the amount of hacks what we want then? Is there any generic

The aim of this patch is to reduce the amount of existing and upcoming
hacks. It may still require some refinements, but I think we haven't
found any smarter approach yet that fits existing use cases.

 solution, like slowing down the guest system to the point where we can
 guarantee the interrupt rate vs. CPU execution speed?

That's generally a non-option in virtualized production environments.
Specifically if the guest system lost interrupts due to host
overcommitment, you do not want it slow down even further.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-27 Thread Blue Swirl
On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Wed, May 26, 2010 at 11:26 PM, Paul Brook p...@codesourcery.com wrote:
 At the other extreme, would it be possible to make the educated guests
 aware of the virtualization also in clock aspect: virtio-clock?
 The guest doesn't even need to be aware of virtualization. It just needs to 
 be
 able to accommodate the lack of guaranteed realtime behavior.

 The fundamental problem here is that some guest operating systems assume 
 that
 the hardware provides certain realtime guarantees with respect to execution 
 of
 interrupt handlers.  In particular they assume that the CPU will always be
 able to complete execution of the timer IRQ handler before the periodic 
 timer
 triggers again.  In most virtualized environments you have absolutely no
 guarantee of realtime response.

 With Linux guests this was solved a long time ago by the introduction of
 tickless kernels.  These separate the timekeeping from wakeup events, so it
 doesn't matter if several wakeup triggers end up getting merged (either at 
 the
 hardware level or via top/bottom half guest IRQ handlers).


 It's worth mentioning that this problem also occurs on real hardware,
 typically due to lame hardware/drivers which end up masking interrupts or
 otherwise stall the CPU for for long periods of time.


 The PIT hack attempts to workaround broken guests by adding artificial 
 latency
 to the timer event, ensuring that the guest sees them all.  Unfortunately
 guests vary on when it is safe for them to see the next timer event, and
 trying to observe this behavior involves potentially harmful heuristics and
 collusion between unrelated devices (e.g. interrupt controller and timer).

 In some cases we don't even do that, and just reschedule the event some
 arbitrarily small amount of time later. This assumes the guest to do useful
 work in that time. In a single threaded environment this is probably true -
 qemu got enough CPU to inject the first interrupt, so will probably manage 
 to
 execute some guest code before the end of its timeslice. In an environment
 where interrupt processing/delivery and execution of the guest code happen 
 in
 different threads this becomes increasingly likely to fail.

 So any voodoo around timer events is doomed to fail in some cases.
 What's the amount of hacks what we want then? Is there any generic

 The aim of this patch is to reduce the amount of existing and upcoming
 hacks. It may still require some refinements, but I think we haven't
 found any smarter approach yet that fits existing use cases.

I don't feel we have tried other possibilities hard enough.

 solution, like slowing down the guest system to the point where we can
 guarantee the interrupt rate vs. CPU execution speed?

 That's generally a non-option in virtualized production environments.
 Specifically if the guest system lost interrupts due to host
 overcommitment, you do not want it slow down even further.

I meant that the guest time could be scaled down, for example 2s in
wall clock time would be presented to the guest as 1s. Then the amount
of CPU cycles between timer interrupts would increase and hopefully
the guest can keep up. If the guest sleeps, time base could be
accelerated to catch up with wall clock and then set back to 1:1 rate.

Slowing down could be triggered by measuring the guest load (for
example, by checking for presence of halt instructions), if it's close
to 1, time would be slowed down. If the guest starts to issue halt
instructions because it's more idle, we can increase speed.

If this approach worked, even APIC could be made ignorant about
coalescing voodoo so it should be a major cleanup.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-27 Thread Jan Kiszka
Blue Swirl wrote:
 On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Wed, May 26, 2010 at 11:26 PM, Paul Brook p...@codesourcery.com wrote:
 At the other extreme, would it be possible to make the educated guests
 aware of the virtualization also in clock aspect: virtio-clock?
 The guest doesn't even need to be aware of virtualization. It just needs 
 to be
 able to accommodate the lack of guaranteed realtime behavior.

 The fundamental problem here is that some guest operating systems assume 
 that
 the hardware provides certain realtime guarantees with respect to 
 execution of
 interrupt handlers.  In particular they assume that the CPU will always be
 able to complete execution of the timer IRQ handler before the periodic 
 timer
 triggers again.  In most virtualized environments you have absolutely no
 guarantee of realtime response.

 With Linux guests this was solved a long time ago by the introduction of
 tickless kernels.  These separate the timekeeping from wakeup events, so it
 doesn't matter if several wakeup triggers end up getting merged (either at 
 the
 hardware level or via top/bottom half guest IRQ handlers).


 It's worth mentioning that this problem also occurs on real hardware,
 typically due to lame hardware/drivers which end up masking interrupts or
 otherwise stall the CPU for for long periods of time.


 The PIT hack attempts to workaround broken guests by adding artificial 
 latency
 to the timer event, ensuring that the guest sees them all.  Unfortunately
 guests vary on when it is safe for them to see the next timer event, and
 trying to observe this behavior involves potentially harmful heuristics and
 collusion between unrelated devices (e.g. interrupt controller and timer).

 In some cases we don't even do that, and just reschedule the event some
 arbitrarily small amount of time later. This assumes the guest to do useful
 work in that time. In a single threaded environment this is probably true -
 qemu got enough CPU to inject the first interrupt, so will probably manage 
 to
 execute some guest code before the end of its timeslice. In an environment
 where interrupt processing/delivery and execution of the guest code happen 
 in
 different threads this becomes increasingly likely to fail.
 So any voodoo around timer events is doomed to fail in some cases.
 What's the amount of hacks what we want then? Is there any generic
 The aim of this patch is to reduce the amount of existing and upcoming
 hacks. It may still require some refinements, but I think we haven't
 found any smarter approach yet that fits existing use cases.
 
 I don't feel we have tried other possibilities hard enough.

Well, seeing prototypes wouldn't be bad, also to run real load againt
them. But at least I'm currently clueless what to implement.

 
 solution, like slowing down the guest system to the point where we can
 guarantee the interrupt rate vs. CPU execution speed?
 That's generally a non-option in virtualized production environments.
 Specifically if the guest system lost interrupts due to host
 overcommitment, you do not want it slow down even further.
 
 I meant that the guest time could be scaled down, for example 2s in
 wall clock time would be presented to the guest as 1s.

But that is precisely what already happens when the guest loses timer
interrupts. There is no other time source for this kind of guests -
often except for some external events generated by systems which you
don't want to fall behind arbitrarily.

 Then the amount
 of CPU cycles between timer interrupts would increase and hopefully
 the guest can keep up. If the guest sleeps, time base could be
 accelerated to catch up with wall clock and then set back to 1:1 rate.

Can't follow you ATM, sorry. What should be slowed down then? And how
precisely?

Jan

 
 Slowing down could be triggered by measuring the guest load (for
 example, by checking for presence of halt instructions), if it's close
 to 1, time would be slowed down. If the guest starts to issue halt
 instructions because it's more idle, we can increase speed.
 
 If this approach worked, even APIC could be made ignorant about
 coalescing voodoo so it should be a major cleanup.




signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-27 Thread Blue Swirl
On Thu, May 27, 2010 at 7:08 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Wed, May 26, 2010 at 11:26 PM, Paul Brook p...@codesourcery.com wrote:
 At the other extreme, would it be possible to make the educated guests
 aware of the virtualization also in clock aspect: virtio-clock?
 The guest doesn't even need to be aware of virtualization. It just needs 
 to be
 able to accommodate the lack of guaranteed realtime behavior.

 The fundamental problem here is that some guest operating systems assume 
 that
 the hardware provides certain realtime guarantees with respect to 
 execution of
 interrupt handlers.  In particular they assume that the CPU will always be
 able to complete execution of the timer IRQ handler before the periodic 
 timer
 triggers again.  In most virtualized environments you have absolutely no
 guarantee of realtime response.

 With Linux guests this was solved a long time ago by the introduction of
 tickless kernels.  These separate the timekeeping from wakeup events, so 
 it
 doesn't matter if several wakeup triggers end up getting merged (either 
 at the
 hardware level or via top/bottom half guest IRQ handlers).


 It's worth mentioning that this problem also occurs on real hardware,
 typically due to lame hardware/drivers which end up masking interrupts or
 otherwise stall the CPU for for long periods of time.


 The PIT hack attempts to workaround broken guests by adding artificial 
 latency
 to the timer event, ensuring that the guest sees them all.  
 Unfortunately
 guests vary on when it is safe for them to see the next timer event, and
 trying to observe this behavior involves potentially harmful heuristics 
 and
 collusion between unrelated devices (e.g. interrupt controller and timer).

 In some cases we don't even do that, and just reschedule the event some
 arbitrarily small amount of time later. This assumes the guest to do 
 useful
 work in that time. In a single threaded environment this is probably true 
 -
 qemu got enough CPU to inject the first interrupt, so will probably 
 manage to
 execute some guest code before the end of its timeslice. In an environment
 where interrupt processing/delivery and execution of the guest code 
 happen in
 different threads this becomes increasingly likely to fail.
 So any voodoo around timer events is doomed to fail in some cases.
 What's the amount of hacks what we want then? Is there any generic
 The aim of this patch is to reduce the amount of existing and upcoming
 hacks. It may still require some refinements, but I think we haven't
 found any smarter approach yet that fits existing use cases.

 I don't feel we have tried other possibilities hard enough.

 Well, seeing prototypes wouldn't be bad, also to run real load againt
 them. But at least I'm currently clueless what to implement.

Perhaps now is then not the time to rush to implement something, but
to brainstorm for a clean solution.


 solution, like slowing down the guest system to the point where we can
 guarantee the interrupt rate vs. CPU execution speed?
 That's generally a non-option in virtualized production environments.
 Specifically if the guest system lost interrupts due to host
 overcommitment, you do not want it slow down even further.

 I meant that the guest time could be scaled down, for example 2s in
 wall clock time would be presented to the guest as 1s.

 But that is precisely what already happens when the guest loses timer
 interrupts. There is no other time source for this kind of guests -
 often except for some external events generated by systems which you
 don't want to fall behind arbitrarily.

 Then the amount
 of CPU cycles between timer interrupts would increase and hopefully
 the guest can keep up. If the guest sleeps, time base could be
 accelerated to catch up with wall clock and then set back to 1:1 rate.

 Can't follow you ATM, sorry. What should be slowed down then? And how
 precisely?

I think vm_clock and everything that depends on vm_clock, also
rtc_clock should be tied to vm_clock in this mode, not host_clock.


 Jan


 Slowing down could be triggered by measuring the guest load (for
 example, by checking for presence of halt instructions), if it's close
 to 1, time would be slowed down. If the guest starts to issue halt
 instructions because it's more idle, we can increase speed.

 If this approach worked, even APIC could be made ignorant about
 coalescing voodoo so it should be a major cleanup.






Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-27 Thread Paul Brook
  In some cases we don't even do that, and just reschedule the event some
  arbitrarily small amount of time later. This assumes the guest to do
  useful work in that time. In a single threaded environment this is
  probably true - qemu got enough CPU to inject the first interrupt, so
  will probably manage to execute some guest code before the end of its
  timeslice. In an environment where interrupt processing/delivery and
  execution of the guest code happen in different threads this becomes
  increasingly likely to fail.
 
 So any voodoo around timer events is doomed to fail in some cases.

Depends on the level of voodoo.
My guess is that common guest operating systems require hacks which result in 
demonstrably incorrect behavior.

 What's the amount of hacks what we want then? Is there any generic
 solution, like slowing down the guest system to the point where we can
 guarantee the interrupt rate vs. CPU execution speed?

The -icount N option gives deterministic virtual realtime behavior, however 
teh guuest if completely decoupled from real-world time.
The -icount auto option gives semi-deterministic behavior while maintaining 
overall consistency with the real world.  This may introduce some small-scale 
time jitter, but will still satisfy all but the most demanding hard-real-time 
assumptions.

Neither of these options work with KVM. It may be possible to implement 
something using using performance counters.  I don't know how much additional 
overhead this would involve.

Paul



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-27 Thread Jan Kiszka
Blue Swirl wrote:
 On Thu, May 27, 2010 at 7:08 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Thu, May 27, 2010 at 6:31 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Wed, May 26, 2010 at 11:26 PM, Paul Brook p...@codesourcery.com 
 wrote:
 At the other extreme, would it be possible to make the educated guests
 aware of the virtualization also in clock aspect: virtio-clock?
 The guest doesn't even need to be aware of virtualization. It just needs 
 to be
 able to accommodate the lack of guaranteed realtime behavior.

 The fundamental problem here is that some guest operating systems assume 
 that
 the hardware provides certain realtime guarantees with respect to 
 execution of
 interrupt handlers.  In particular they assume that the CPU will always 
 be
 able to complete execution of the timer IRQ handler before the periodic 
 timer
 triggers again.  In most virtualized environments you have absolutely no
 guarantee of realtime response.

 With Linux guests this was solved a long time ago by the introduction of
 tickless kernels.  These separate the timekeeping from wakeup events, so 
 it
 doesn't matter if several wakeup triggers end up getting merged (either 
 at the
 hardware level or via top/bottom half guest IRQ handlers).


 It's worth mentioning that this problem also occurs on real hardware,
 typically due to lame hardware/drivers which end up masking interrupts or
 otherwise stall the CPU for for long periods of time.


 The PIT hack attempts to workaround broken guests by adding artificial 
 latency
 to the timer event, ensuring that the guest sees them all.  
 Unfortunately
 guests vary on when it is safe for them to see the next timer event, and
 trying to observe this behavior involves potentially harmful heuristics 
 and
 collusion between unrelated devices (e.g. interrupt controller and 
 timer).

 In some cases we don't even do that, and just reschedule the event some
 arbitrarily small amount of time later. This assumes the guest to do 
 useful
 work in that time. In a single threaded environment this is probably 
 true -
 qemu got enough CPU to inject the first interrupt, so will probably 
 manage to
 execute some guest code before the end of its timeslice. In an 
 environment
 where interrupt processing/delivery and execution of the guest code 
 happen in
 different threads this becomes increasingly likely to fail.
 So any voodoo around timer events is doomed to fail in some cases.
 What's the amount of hacks what we want then? Is there any generic
 The aim of this patch is to reduce the amount of existing and upcoming
 hacks. It may still require some refinements, but I think we haven't
 found any smarter approach yet that fits existing use cases.
 I don't feel we have tried other possibilities hard enough.
 Well, seeing prototypes wouldn't be bad, also to run real load againt
 them. But at least I'm currently clueless what to implement.
 
 Perhaps now is then not the time to rush to implement something, but
 to brainstorm for a clean solution.

And sometimes it can help to understand how ideas could even be improved
or why others doesn't work at all.

 
 solution, like slowing down the guest system to the point where we can
 guarantee the interrupt rate vs. CPU execution speed?
 That's generally a non-option in virtualized production environments.
 Specifically if the guest system lost interrupts due to host
 overcommitment, you do not want it slow down even further.
 I meant that the guest time could be scaled down, for example 2s in
 wall clock time would be presented to the guest as 1s.
 But that is precisely what already happens when the guest loses timer
 interrupts. There is no other time source for this kind of guests -
 often except for some external events generated by systems which you
 don't want to fall behind arbitrarily.

 Then the amount
 of CPU cycles between timer interrupts would increase and hopefully
 the guest can keep up. If the guest sleeps, time base could be
 accelerated to catch up with wall clock and then set back to 1:1 rate.
 Can't follow you ATM, sorry. What should be slowed down then? And how
 precisely?
 
 I think vm_clock and everything that depends on vm_clock, also
 rtc_clock should be tied to vm_clock in this mode, not host_clock.

Let me check if I got this idea correctly: Instead of tuning just the
tick frequency of the affected timer device / sending its backlog in a
row, you rather want to tune the vm_clock correspondingly? Maybe a way
to abstract the required logic currently sitting only in the RTC for use
by other timer sources as well.

But just switching rtc_clock to vm_clock when the user wants host_clock
is obviously not an option. We would rather have to tune host_clock in
parallel.

Still, this does not answer:

- How do you want to detect lost timer ticks?

- What subsystem(s) keeps track of the backlog?

- And depending on the above: How to detect at all that a specific IRQ
  is a timer tick?

Jan




Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-27 Thread Paul Brook

  Then the amount
  of CPU cycles between timer interrupts would increase and hopefully
  the guest can keep up. If the guest sleeps, time base could be
  accelerated to catch up with wall clock and then set back to 1:1 rate.
  
  Can't follow you ATM, sorry. What should be slowed down then? And how
  precisely?
 
 I think vm_clock and everything that depends on vm_clock, also
 rtc_clock should be tied to vm_clock in this mode, not host_clock.

The problem is more fundamental than that. There is no real correlation 
between vm_clock and the amount of code executed by the guest, especially not 
on timescales less than a second.

Paul



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-26 Thread Gleb Natapov
On Tue, May 25, 2010 at 11:44:50PM +0200, Jan Kiszka wrote:
  
  I think the real solution to coalescing is put the logic inside one
  device, in this case APIC because it has the information about irq
  delivery. APIC could monitor incoming RTC irqs for frequency
  information and whether they get delivered or not. If not, an internal
  timer is installed which injects the lost irqs.
 
 That won't fly as the IRQs will already arrive at the APIC with a
 sufficiently high jitter. At the bare minimum, you need to tell the
 interrupt controller about the fact that a particular IRQ should be
 delivered at a specific regular rate. For this, you also need a generic
 interface - nothing really won.
 
Not only that. Suppose RTC runs with 64Hz frequency and N interrupts
were coalesced during some period. Now the guest reprograms RTC to
1024Hz. N should be scaled accordingly otherwise reinjection will not
fix the drift. Do you propose to put this logic into APIC to?

--
Gleb.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-26 Thread Blue Swirl
On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka jan.kis...@web.de wrote:
 Anthony Liguori wrote:
 On 05/25/2010 02:09 PM, Blue Swirl wrote:
 On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  wrote:

 From: Jan Kiszkajan.kis...@siemens.com

 This allows to communicate potential IRQ coalescing during delivery from
 the sink back to the source. Targets that support IRQ coalescing
 workarounds need to register handlers that return the appropriate
 QEMU_IRQ_* code, and they have to propergate the code across all IRQ
 redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
 apply its workaround. If multiple sinks exist, the source may only
 consider an IRQ coalesced if all other sinks either report
 QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.

 No real devices are interested whether any of their output lines are
 even connected. This would introduce a new signal type, bidirectional
 multi-level, which is not correct.


 I don't think it's really an issue of correct, but I wouldn't disagree
 to a suggestion that we ought to introduce a new signal type for this
 type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
 similar interface as qemu_irq.

 A separate type would complicate the delivery of the feedback value
 across GPIO pins (as Paul requested for the RTC-HPET routing).


 I think the real solution to coalescing is put the logic inside one
 device, in this case APIC because it has the information about irq
 delivery. APIC could monitor incoming RTC irqs for frequency
 information and whether they get delivered or not. If not, an internal
 timer is installed which injects the lost irqs.

 That won't fly as the IRQs will already arrive at the APIC with a
 sufficiently high jitter. At the bare minimum, you need to tell the
 interrupt controller about the fact that a particular IRQ should be
 delivered at a specific regular rate. For this, you also need a generic
 interface - nothing really won.

OK, let's simplify: just reinject at next possible chance. No need to
monitor or tell anything.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-26 Thread Blue Swirl
On Tue, May 25, 2010 at 8:16 PM, Anthony Liguori anth...@codemonkey.ws wrote:
 On 05/25/2010 02:09 PM, Blue Swirl wrote:

 On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  wrote:


 From: Jan Kiszkajan.kis...@siemens.com

 This allows to communicate potential IRQ coalescing during delivery from
 the sink back to the source. Targets that support IRQ coalescing
 workarounds need to register handlers that return the appropriate
 QEMU_IRQ_* code, and they have to propergate the code across all IRQ
 redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
 apply its workaround. If multiple sinks exist, the source may only
 consider an IRQ coalesced if all other sinks either report
 QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.


 No real devices are interested whether any of their output lines are
 even connected. This would introduce a new signal type, bidirectional
 multi-level, which is not correct.


 I don't think it's really an issue of correct, but I wouldn't disagree to a
 suggestion that we ought to introduce a new signal type for this type of
 bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a similar
 interface as qemu_irq.

I'd rather try harder to find a solution for the problem without this.

 I think the real solution to coalescing is put the logic inside one
 device, in this case APIC because it has the information about irq
 delivery. APIC could monitor incoming RTC irqs for frequency
 information and whether they get delivered or not. If not, an internal
 timer is installed which injects the lost irqs.

 Of course, no real device could do such de-coalescing, but with this
 approach, the voodoo is contained to insides of one device, APIC.

 We should also take a step back to think what was the cause of lost
 irqs, IIRC uneven execution rate in QEMU.

 Not only that.  The pathological case is where a host is limited to a 1khz
 timer frequency and the guest requests a 1khz timer frequency.  Practically
 speaking, there is no way we'll ever be able to adjust timers to reinject
 lost interrupts because of the host timer limitation.

But APIC knows when the guest has acked a timer irq and so can
immediately issue another one to cover a lost irq. APIC could also
arrange a CPU exit. The irqs would not arrive at even rate, but in
this pathological case, the host can't ever schedule any delays less
than 1ms.

  Could this be fixed or taken
 into account in timer handling? For example, CPU loop could analyze
 the wall clock time between CPU exits and use that to offset the
 timers. Thus the timer frequency (in wall clock time) could be made to
 correspond a bit more to VCPU execution rate.


 A lot of what motivates the timer reinjection work is very old linux kernels
 that had fixed userspace timer frequencies.  On newer host kernels, it's
 probably not nearly as important except when you get into pathological cases
 like exposing a high frequency HPET timer to the guest where you cannot keep
 up with the host.

There will always be some delays and overhead because of the
emulation. We can either hide this completely from the guest (by
stretching the guest time) or we can try to keep the guest time base
in synch with host but then the guest can't expect native class
performance to be available between timer irqs.

On a native hardware, you could have a 1kHz timer and a kernel task
always running for max. 999us with interrupts disabled, then for at
least 1us the interrupts are enabled and the timer interrupt handler
gets a chance to run. On QEMU, such a small time window may be
impossible to emulate. No coalescing would help in this case. With a
cycle counting emulation it's possible, but the speed would probably
be slower than native. With the interrupt coalescing you may be able
to cover some easy cases.


 Regards,

 Anthony Liguori

 Signed-off-by: Jan Kiszkajan.kis...@siemens.com
 ---
  hw/irq.c |   38 +-
  hw/irq.h |   22 +++---
  2 files changed, 44 insertions(+), 16 deletions(-)

 diff --git a/hw/irq.c b/hw/irq.c
 index 7703f62..db2cce6 100644
 --- a/hw/irq.c
 +++ b/hw/irq.c
 @@ -26,19 +26,27 @@

  struct IRQState {
     qemu_irq_handler handler;
 +    qemu_irq_fb_handler feedback_handler;
     void *opaque;
     int n;
  };

 -void qemu_set_irq(qemu_irq irq, int level)
 +int qemu_set_irq(qemu_irq irq, int level)
  {
 -    if (!irq)
 -        return;
 -
 -    irq-handler(irq-opaque, irq-n, level);
 +    if (!irq) {
 +        return 0;
 +    }
 +    if (irq-feedback_handler) {
 +        return irq-feedback_handler(irq-opaque, irq-n, level);
 +    } else {
 +        irq-handler(irq-opaque, irq-n, level);
 +        return QEMU_IRQ_DELIVERED;
 +    }
  }

 -qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int
 n)
 +static qemu_irq *allocate_irqs(qemu_irq_handler handler,
 +                               qemu_irq_fb_handler feedback_handler,
 +                               void *opaque, 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-26 Thread Jan Kiszka
Blue Swirl wrote:
 On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka jan.kis...@web.de wrote:
 Anthony Liguori wrote:
 On 05/25/2010 02:09 PM, Blue Swirl wrote:
 On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  wrote:

 From: Jan Kiszkajan.kis...@siemens.com

 This allows to communicate potential IRQ coalescing during delivery from
 the sink back to the source. Targets that support IRQ coalescing
 workarounds need to register handlers that return the appropriate
 QEMU_IRQ_* code, and they have to propergate the code across all IRQ
 redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
 apply its workaround. If multiple sinks exist, the source may only
 consider an IRQ coalesced if all other sinks either report
 QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.

 No real devices are interested whether any of their output lines are
 even connected. This would introduce a new signal type, bidirectional
 multi-level, which is not correct.

 I don't think it's really an issue of correct, but I wouldn't disagree
 to a suggestion that we ought to introduce a new signal type for this
 type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
 similar interface as qemu_irq.
 A separate type would complicate the delivery of the feedback value
 across GPIO pins (as Paul requested for the RTC-HPET routing).

 I think the real solution to coalescing is put the logic inside one
 device, in this case APIC because it has the information about irq
 delivery. APIC could monitor incoming RTC irqs for frequency
 information and whether they get delivered or not. If not, an internal
 timer is installed which injects the lost irqs.
 That won't fly as the IRQs will already arrive at the APIC with a
 sufficiently high jitter. At the bare minimum, you need to tell the
 interrupt controller about the fact that a particular IRQ should be
 delivered at a specific regular rate. For this, you also need a generic
 interface - nothing really won.
 
 OK, let's simplify: just reinject at next possible chance. No need to
 monitor or tell anything.

There are guests that won't like this (I know of one in-house, but
others may even have more examples), specifically if you end up firing
multiple IRQs in a row due to a longer backlog. For that reason, the RTC
spreads the reinjection according to the current rate.

And even if the rate did not matter, the APIC woult still have to now
about the fact that an IRQ is really periodic and does not only appear
as such for a certain interval. This really does not sound like
simplifying things or even make them cleaner.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-26 Thread Blue Swirl
On Wed, May 26, 2010 at 8:08 AM, Gleb Natapov g...@redhat.com wrote:
 On Tue, May 25, 2010 at 11:44:50PM +0200, Jan Kiszka wrote:
 
  I think the real solution to coalescing is put the logic inside one
  device, in this case APIC because it has the information about irq
  delivery. APIC could monitor incoming RTC irqs for frequency
  information and whether they get delivered or not. If not, an internal
  timer is installed which injects the lost irqs.

 That won't fly as the IRQs will already arrive at the APIC with a
 sufficiently high jitter. At the bare minimum, you need to tell the
 interrupt controller about the fact that a particular IRQ should be
 delivered at a specific regular rate. For this, you also need a generic
 interface - nothing really won.

 Not only that. Suppose RTC runs with 64Hz frequency and N interrupts
 were coalesced during some period. Now the guest reprograms RTC to
 1024Hz. N should be scaled accordingly otherwise reinjection will not
 fix the drift. Do you propose to put this logic into APIC to?

Interesting case, I don't think this is handled by the current code
either. Could this happen if the target machine does not have HPET and
the guest runs a tickless kernel?

I think the guest would be buggy to reprogram the timer without
checking that the interrupt from the previous timer frequency won't
interfere, for example by stopping the clock, or doing the
reprogramming only at timer interrupt handler. Otherwise the period
may be unpredictable for one period, which means that the time base
has been lost.

But let's consider how this could be handled with the current code (or
with the magical interrupts). When doing the scaling, the guest, while
reprogramming, is unaware of the old queued interrupts with the
previous rate. Why would scaling these be more correct? I'd think the
old ones should be just reinjected ASAP without any scaling.



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-26 Thread Blue Swirl
On Wed, May 26, 2010 at 8:09 PM, Jan Kiszka jan.kis...@web.de wrote:
 Blue Swirl wrote:
 On Tue, May 25, 2010 at 9:44 PM, Jan Kiszka jan.kis...@web.de wrote:
 Anthony Liguori wrote:
 On 05/25/2010 02:09 PM, Blue Swirl wrote:
 On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  wrote:

 From: Jan Kiszkajan.kis...@siemens.com

 This allows to communicate potential IRQ coalescing during delivery from
 the sink back to the source. Targets that support IRQ coalescing
 workarounds need to register handlers that return the appropriate
 QEMU_IRQ_* code, and they have to propergate the code across all IRQ
 redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
 apply its workaround. If multiple sinks exist, the source may only
 consider an IRQ coalesced if all other sinks either report
 QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.

 No real devices are interested whether any of their output lines are
 even connected. This would introduce a new signal type, bidirectional
 multi-level, which is not correct.

 I don't think it's really an issue of correct, but I wouldn't disagree
 to a suggestion that we ought to introduce a new signal type for this
 type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
 similar interface as qemu_irq.
 A separate type would complicate the delivery of the feedback value
 across GPIO pins (as Paul requested for the RTC-HPET routing).

 I think the real solution to coalescing is put the logic inside one
 device, in this case APIC because it has the information about irq
 delivery. APIC could monitor incoming RTC irqs for frequency
 information and whether they get delivered or not. If not, an internal
 timer is installed which injects the lost irqs.
 That won't fly as the IRQs will already arrive at the APIC with a
 sufficiently high jitter. At the bare minimum, you need to tell the
 interrupt controller about the fact that a particular IRQ should be
 delivered at a specific regular rate. For this, you also need a generic
 interface - nothing really won.

 OK, let's simplify: just reinject at next possible chance. No need to
 monitor or tell anything.

 There are guests that won't like this (I know of one in-house, but
 others may even have more examples), specifically if you end up firing
 multiple IRQs in a row due to a longer backlog. For that reason, the RTC
 spreads the reinjection according to the current rate.

Then reinject with a constant delay, or next CPU exit. Such buggy
guests could also be assisted with special handling (like win2k
install hack), for example guest instructions could be counted
(approximately, for example using TB size or TSC) and only inject
after at least N instructions have passed.

 And even if the rate did not matter, the APIC woult still have to now
 about the fact that an IRQ is really periodic and does not only appear
 as such for a certain interval. This really does not sound like
 simplifying things or even make them cleaner.

It would, the voodoo would be contained only in APIC, RTC would be
just like any other device. With the bidirectional irqs, this voodoo
would probably eventually spread to many other devices. The logical
conclusion of that would be a system where all devices would be
careful not to disturb the guest at wrong moment because that would
trigger a bug.

At the other extreme, would it be possible to make the educated guests
aware of the virtualization also in clock aspect: virtio-clock?



Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-26 Thread Jan Kiszka
Blue Swirl wrote:
 I think the real solution to coalescing is put the logic inside one
 device, in this case APIC because it has the information about irq
 delivery. APIC could monitor incoming RTC irqs for frequency
 information and whether they get delivered or not. If not, an internal
 timer is installed which injects the lost irqs.
 That won't fly as the IRQs will already arrive at the APIC with a
 sufficiently high jitter. At the bare minimum, you need to tell the
 interrupt controller about the fact that a particular IRQ should be
 delivered at a specific regular rate. For this, you also need a generic
 interface - nothing really won.
 OK, let's simplify: just reinject at next possible chance. No need to
 monitor or tell anything.
 There are guests that won't like this (I know of one in-house, but
 others may even have more examples), specifically if you end up firing
 multiple IRQs in a row due to a longer backlog. For that reason, the RTC
 spreads the reinjection according to the current rate.
 
 Then reinject with a constant delay, or next CPU exit. Such buggy
 guests could also be assisted with special handling (like win2k
 install hack), for example guest instructions could be counted
 (approximately, for example using TB size or TSC) and only inject
 after at least N instructions have passed.

Let's assume that would be an alternative...

 
 And even if the rate did not matter, the APIC woult still have to now
 about the fact that an IRQ is really periodic and does not only appear
 as such for a certain interval. This really does not sound like
 simplifying things or even make them cleaner.
 
 It would, the voodoo would be contained only in APIC, RTC would be
 just like any other device.

...we would still need that communication channel from the RTC to the
APIC to tell the latter about entering/leaving periodic IRQ generation.
And as we don't want the RTC requiring any clue about who's finally
delivering the IRQ, we still need some generic interface.

This cannot be contained in the APIC - even leaving the PIC as another
IRQ delivery service aside for now, and maybe other virtualized
architectures.

 With the bidirectional irqs, this voodoo
 would probably eventually spread to many other devices.

Yes, the HPET was reported to need this feature as well.

 The logical
 conclusion of that would be a system where all devices would be
 careful not to disturb the guest at wrong moment because that would
 trigger a bug.
 
 At the other extreme, would it be possible to make the educated guests
 aware of the virtualization also in clock aspect: virtio-clock?

Paravirtual clocks are around since ages (e.g. kvm-clock), just
unfortunately not long enough to find them supported by legacy guests.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-26 Thread Paul Brook
 At the other extreme, would it be possible to make the educated guests
 aware of the virtualization also in clock aspect: virtio-clock?

The guest doesn't even need to be aware of virtualization. It just needs to be 
able to accommodate the lack of guaranteed realtime behavior.

The fundamental problem here is that some guest operating systems assume that 
the hardware provides certain realtime guarantees with respect to execution of 
interrupt handlers.  In particular they assume that the CPU will always be 
able to complete execution of the timer IRQ handler before the periodic timer 
triggers again.  In most virtualized environments you have absolutely no 
guarantee of realtime response.

With Linux guests this was solved a long time ago by the introduction of 
tickless kernels.  These separate the timekeeping from wakeup events, so it 
doesn't matter if several wakeup triggers end up getting merged (either at the 
hardware level or via top/bottom half guest IRQ handlers).


It's worth mentioning that this problem also occurs on real hardware, 
typically due to lame hardware/drivers which end up masking interrupts or 
otherwise stall the CPU for for long periods of time.


The PIT hack attempts to workaround broken guests by adding artificial latency 
to the timer event, ensuring that the guest sees them all.  Unfortunately 
guests vary on when it is safe for them to see the next timer event, and 
trying to observe this behavior involves potentially harmful heuristics and 
collusion between unrelated devices (e.g. interrupt controller and timer).

In some cases we don't even do that, and just reschedule the event some 
arbitrarily small amount of time later. This assumes the guest to do useful 
work in that time. In a single threaded environment this is probably true - 
qemu got enough CPU to inject the first interrupt, so will probably manage to 
execute some guest code before the end of its timeslice. In an environment 
where interrupt processing/delivery and execution of the guest code happen in 
different threads this becomes increasingly likely to fail.

Paul



[Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-25 Thread Blue Swirl
On Mon, May 24, 2010 at 8:13 PM, Jan Kiszka jan.kis...@web.de wrote:
 From: Jan Kiszka jan.kis...@siemens.com

 This allows to communicate potential IRQ coalescing during delivery from
 the sink back to the source. Targets that support IRQ coalescing
 workarounds need to register handlers that return the appropriate
 QEMU_IRQ_* code, and they have to propergate the code across all IRQ
 redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
 apply its workaround. If multiple sinks exist, the source may only
 consider an IRQ coalesced if all other sinks either report
 QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.

No real devices are interested whether any of their output lines are
even connected. This would introduce a new signal type, bidirectional
multi-level, which is not correct.

I think the real solution to coalescing is put the logic inside one
device, in this case APIC because it has the information about irq
delivery. APIC could monitor incoming RTC irqs for frequency
information and whether they get delivered or not. If not, an internal
timer is installed which injects the lost irqs.

Of course, no real device could do such de-coalescing, but with this
approach, the voodoo is contained to insides of one device, APIC.

We should also take a step back to think what was the cause of lost
irqs, IIRC uneven execution rate in QEMU. Could this be fixed or taken
into account in timer handling? For example, CPU loop could analyze
the wall clock time between CPU exits and use that to offset the
timers. Thus the timer frequency (in wall clock time) could be made to
correspond a bit more to VCPU execution rate.


 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  hw/irq.c |   38 +-
  hw/irq.h |   22 +++---
  2 files changed, 44 insertions(+), 16 deletions(-)

 diff --git a/hw/irq.c b/hw/irq.c
 index 7703f62..db2cce6 100644
 --- a/hw/irq.c
 +++ b/hw/irq.c
 @@ -26,19 +26,27 @@

  struct IRQState {
     qemu_irq_handler handler;
 +    qemu_irq_fb_handler feedback_handler;
     void *opaque;
     int n;
  };

 -void qemu_set_irq(qemu_irq irq, int level)
 +int qemu_set_irq(qemu_irq irq, int level)
  {
 -    if (!irq)
 -        return;
 -
 -    irq-handler(irq-opaque, irq-n, level);
 +    if (!irq) {
 +        return 0;
 +    }
 +    if (irq-feedback_handler) {
 +        return irq-feedback_handler(irq-opaque, irq-n, level);
 +    } else {
 +        irq-handler(irq-opaque, irq-n, level);
 +        return QEMU_IRQ_DELIVERED;
 +    }
  }

 -qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
 +static qemu_irq *allocate_irqs(qemu_irq_handler handler,
 +                               qemu_irq_fb_handler feedback_handler,
 +                               void *opaque, int n)
  {
     qemu_irq *s;
     struct IRQState *p;
 @@ -48,6 +56,7 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void 
 *opaque, int n)
     p = (struct IRQState *)qemu_mallocz(sizeof(struct IRQState) * n);
     for (i = 0; i  n; i++) {
         p-handler = handler;
 +        p-feedback_handler = feedback_handler;
         p-opaque = opaque;
         p-n = i;
         s[i] = p;
 @@ -56,22 +65,33 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, 
 void *opaque, int n)
     return s;
  }

 +qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
 +{
 +    return allocate_irqs(handler, NULL, opaque, n);
 +}
 +
 +qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
 +                                      void *opaque, int n)
 +{
 +    return allocate_irqs(NULL, handler, opaque, n);
 +}
 +
  void qemu_free_irqs(qemu_irq *s)
  {
     qemu_free(s[0]);
     qemu_free(s);
  }

 -static void qemu_notirq(void *opaque, int line, int level)
 +static int qemu_notirq(void *opaque, int line, int level)
  {
     struct IRQState *irq = opaque;

 -    irq-handler(irq-opaque, irq-n, !level);
 +    return qemu_set_irq(irq, !level);
  }

  qemu_irq qemu_irq_invert(qemu_irq irq)
  {
     /* The default state for IRQs is low, so raise the output now.  */
     qemu_irq_raise(irq);
 -    return qemu_allocate_irqs(qemu_notirq, irq, 1)[0];
 +    return allocate_irqs(NULL, qemu_notirq, irq, 1)[0];
  }
 diff --git a/hw/irq.h b/hw/irq.h
 index 5daae44..eee03e6 100644
 --- a/hw/irq.h
 +++ b/hw/irq.h
 @@ -3,15 +3,18 @@

  /* Generic IRQ/GPIO pin infrastructure.  */

 -/* FIXME: Rmove one of these.  */
 +#define QEMU_IRQ_DELIVERED      0
 +#define QEMU_IRQ_COALESCED      (-1)
 +#define QEMU_IRQ_MASKED         (-2)
 +
  typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
 -typedef void SetIRQFunc(void *opaque, int irq_num, int level);
 +typedef int (*qemu_irq_fb_handler)(void *opaque, int n, int level);

 -void qemu_set_irq(qemu_irq irq, int level);
 +int qemu_set_irq(qemu_irq irq, int level);

 -static inline void qemu_irq_raise(qemu_irq irq)
 +static inline int qemu_irq_raise(qemu_irq irq)
  {
 -    qemu_set_irq(irq, 1);
 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-25 Thread Anthony Liguori

On 05/25/2010 02:09 PM, Blue Swirl wrote:

On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  wrote:
   

From: Jan Kiszkajan.kis...@siemens.com

This allows to communicate potential IRQ coalescing during delivery from
the sink back to the source. Targets that support IRQ coalescing
workarounds need to register handlers that return the appropriate
QEMU_IRQ_* code, and they have to propergate the code across all IRQ
redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
apply its workaround. If multiple sinks exist, the source may only
consider an IRQ coalesced if all other sinks either report
QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
 

No real devices are interested whether any of their output lines are
even connected. This would introduce a new signal type, bidirectional
multi-level, which is not correct.
   


I don't think it's really an issue of correct, but I wouldn't disagree 
to a suggestion that we ought to introduce a new signal type for this 
type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a 
similar interface as qemu_irq.



I think the real solution to coalescing is put the logic inside one
device, in this case APIC because it has the information about irq
delivery. APIC could monitor incoming RTC irqs for frequency
information and whether they get delivered or not. If not, an internal
timer is installed which injects the lost irqs.

Of course, no real device could do such de-coalescing, but with this
approach, the voodoo is contained to insides of one device, APIC.

We should also take a step back to think what was the cause of lost
irqs, IIRC uneven execution rate in QEMU.


Not only that.  The pathological case is where a host is limited to a 
1khz timer frequency and the guest requests a 1khz timer frequency.  
Practically speaking, there is no way we'll ever be able to adjust 
timers to reinject lost interrupts because of the host timer limitation.



  Could this be fixed or taken
into account in timer handling? For example, CPU loop could analyze
the wall clock time between CPU exits and use that to offset the
timers. Thus the timer frequency (in wall clock time) could be made to
correspond a bit more to VCPU execution rate.
   


A lot of what motivates the timer reinjection work is very old linux 
kernels that had fixed userspace timer frequencies.  On newer host 
kernels, it's probably not nearly as important except when you get into 
pathological cases like exposing a high frequency HPET timer to the 
guest where you cannot keep up with the host.


Regards,

Anthony Liguori


Signed-off-by: Jan Kiszkajan.kis...@siemens.com
---
  hw/irq.c |   38 +-
  hw/irq.h |   22 +++---
  2 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/hw/irq.c b/hw/irq.c
index 7703f62..db2cce6 100644
--- a/hw/irq.c
+++ b/hw/irq.c
@@ -26,19 +26,27 @@

  struct IRQState {
 qemu_irq_handler handler;
+qemu_irq_fb_handler feedback_handler;
 void *opaque;
 int n;
  };

-void qemu_set_irq(qemu_irq irq, int level)
+int qemu_set_irq(qemu_irq irq, int level)
  {
-if (!irq)
-return;
-
-irq-handler(irq-opaque, irq-n, level);
+if (!irq) {
+return 0;
+}
+if (irq-feedback_handler) {
+return irq-feedback_handler(irq-opaque, irq-n, level);
+} else {
+irq-handler(irq-opaque, irq-n, level);
+return QEMU_IRQ_DELIVERED;
+}
  }

-qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
+static qemu_irq *allocate_irqs(qemu_irq_handler handler,
+   qemu_irq_fb_handler feedback_handler,
+   void *opaque, int n)
  {
 qemu_irq *s;
 struct IRQState *p;
@@ -48,6 +56,7 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void 
*opaque, int n)
 p = (struct IRQState *)qemu_mallocz(sizeof(struct IRQState) * n);
 for (i = 0; i  n; i++) {
 p-handler = handler;
+p-feedback_handler = feedback_handler;
 p-opaque = opaque;
 p-n = i;
 s[i] = p;
@@ -56,22 +65,33 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void 
*opaque, int n)
 return s;
  }

+qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
+{
+return allocate_irqs(handler, NULL, opaque, n);
+}
+
+qemu_irq *qemu_allocate_feedback_irqs(qemu_irq_fb_handler handler,
+  void *opaque, int n)
+{
+return allocate_irqs(NULL, handler, opaque, n);
+}
+
  void qemu_free_irqs(qemu_irq *s)
  {
 qemu_free(s[0]);
 qemu_free(s);
  }

-static void qemu_notirq(void *opaque, int line, int level)
+static int qemu_notirq(void *opaque, int line, int level)
  {
 struct IRQState *irq = opaque;

-irq-handler(irq-opaque, irq-n, !level);
+return qemu_set_irq(irq, !level);
  }

  qemu_irq qemu_irq_invert(qemu_irq irq)
  {
 /* The default state for IRQs is low, so raise the 

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback

2010-05-25 Thread Jan Kiszka
Anthony Liguori wrote:
 On 05/25/2010 02:09 PM, Blue Swirl wrote:
 On Mon, May 24, 2010 at 8:13 PM, Jan Kiszkajan.kis...@web.de  wrote:
   
 From: Jan Kiszkajan.kis...@siemens.com

 This allows to communicate potential IRQ coalescing during delivery from
 the sink back to the source. Targets that support IRQ coalescing
 workarounds need to register handlers that return the appropriate
 QEMU_IRQ_* code, and they have to propergate the code across all IRQ
 redirections. If the IRQ source receives a QEMU_IRQ_COALESCED, it can
 apply its workaround. If multiple sinks exist, the source may only
 consider an IRQ coalesced if all other sinks either report
 QEMU_IRQ_COALESCED as well or QEMU_IRQ_MASKED.
  
 No real devices are interested whether any of their output lines are
 even connected. This would introduce a new signal type, bidirectional
 multi-level, which is not correct.

 
 I don't think it's really an issue of correct, but I wouldn't disagree
 to a suggestion that we ought to introduce a new signal type for this
 type of bidirectional feedback.  Maybe it's qemu_coalesced_irq and has a
 similar interface as qemu_irq.

A separate type would complicate the delivery of the feedback value
across GPIO pins (as Paul requested for the RTC-HPET routing).

 
 I think the real solution to coalescing is put the logic inside one
 device, in this case APIC because it has the information about irq
 delivery. APIC could monitor incoming RTC irqs for frequency
 information and whether they get delivered or not. If not, an internal
 timer is installed which injects the lost irqs.

That won't fly as the IRQs will already arrive at the APIC with a
sufficiently high jitter. At the bare minimum, you need to tell the
interrupt controller about the fact that a particular IRQ should be
delivered at a specific regular rate. For this, you also need a generic
interface - nothing really won.

Jan



signature.asc
Description: OpenPGP digital signature