Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Michael S. Tsirkin
On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote:
> Hi Michael & Alexander:
> Thanks a lot for your comments and suggestions.

It's nice that it's appreciated, but you then go on and ignore
all that I have written here:
https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html

> We still need to support Windows guest for migration and this is why our
> patches keep all changes in the driver since it's impossible to change
> Windows kernel.

This is not a reasonable argument.  It makes no sense to duplicate code
on Linux because you must duplicate code on Windows.  Let's assume you
must do it in the driver on windows because windows has closed source
drivers.  What does it matter? Linux can still do it as part of DMA API
and have it apply to all drivers.

> Following is my idea to do DMA tracking.
> 
> Inject event to VF driver after memory iterate stage
> and before stop VCPU and then VF driver marks dirty all
> using DMA memory. The new allocated pages also need to
> be marked dirty before stopping VCPU. All dirty memory
> in this time slot will be migrated until stop-and-copy
> stage. We also need to make sure to disable VF via clearing the
> bus master enable bit for VF before migrating these memory.
> 
> The dma page allocated by VF driver also needs to reserve space
> to do dummy write.

I suggested ways to do it all in the hypervisor without driver hacks, or
hide it within DMA API without need to reserve extra space. Both
approaches seem much cleaner.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu



On 12/8/2015 1:12 AM, Alexander Duyck wrote:

On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu  wrote:

On 12/5/2015 1:07 AM, Alexander Duyck wrote:



We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.



That is a poor argument.  I highly doubt Microsoft is interested in
having to modify all of the drivers that will support direct assignment
in order to support migration.  They would likely request something
similar to what I have in that they will want a way to do DMA tracking
with minimal modification required to the drivers.



This totally depends on the NIC or other devices' vendors and they
should make decision to support migration or not. If yes, they would
modify driver.


Having to modify every driver that wants to support live migration is
a bit much.  In addition I don't see this being limited only to NIC
devices.  You can direct assign a number of different devices, your
solution cannot be specific to NICs.


We are also adding such migration support for QAT device and so our
solution will not just be limit to NIC. Now just is the beginning.

We can't limit user to only use Linux guest. So the migration feature
should work for both Windows and Linux guest.




If just target to call suspend/resume during migration, the feature will
be meaningless. Most cases don't want to affect user during migration
a lot and so the service down time is vital. Our target is to apply
SRIOV NIC passthough to cloud service and NFV(network functions
virtualization) projects which are sensitive to network performance
and stability. From my opinion, We should give a change for device
driver to implement itself migration job. Call suspend and resume
callback in the driver if it doesn't care the performance during migration.


The suspend/resume callback should be efficient in terms of time.
After all we don't want the system to stall for a long period of time
when it should be either running or asleep.  Having it burn cycles in
a power state limbo doesn't do anyone any good.  If nothing else maybe
it will help to push the vendors to speed up those functions which
then benefit migration and the system sleep states.


If we can benefit both migration and suspend, that would be wonderful.
But migration and system pm is still different. Just for example,
driver doesn't need to put device into deep D-status during migration
and host can do this after migration while it's essential for
system sleep. PCI configure space and interrupt config is emulated by
Qemu and Qemu can migrate these configures to new machine. Driver
doesn't need to deal with such thing. So I think migration still needs a
different callback or different code path than device suspend/resume.

Another concern is that we have to rework PM core ore PCI bus driver
to call suspend/resume for passthrough devices during migration. This
also blocks new feature works on the Windows.



Also you keep assuming you can keep the device running while you do
the migration and you can't.  You are going to corrupt the memory if
you do, and you have yet to provide any means to explain how you are
going to solve that.



The main problem is tracking DMA issue. I will repose my solution in the
new thread for discussion. If not way to mark DMA page dirty when
DMA is enabled, we have to stop DMA for a small time to do that at the
last stage.








Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.



The ordering of your explanation here doesn't quite work.  What needs to
happen is that you have to disable DMA and then mark the pages as dirty.
   What the disabling of the BME does is signal to the hypervisor that
the device is now stopped.  The ixgbevf_suspend call already supported
by the driver is almost exactly what is needed to take care of something
like this.



This is why I hope to reserve a piece of space in the dma page to do dummy
write. This can help to mark page dirty while not require to stop DMA and
not race with DMA data.


You can't and it will still race.  What concerns me is that your
patches and the document you referenced earlier show a considerable
lack of understanding about how DMA and device drivers work.  There is
a reason why device drivers have so many memory barriers and the like
in them.  The fact is when you have CPU and a device both accessing
memory things have to be done in a very specific order and you cannot
violate that.

If you have a contiguous block of memory you expect the device to
write into you cannot just poke a hole in 

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu

On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote:

On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote:

Hi Michael & Alexander:
Thanks a lot for your comments and suggestions.


It's nice that it's appreciated, but you then go on and ignore
all that I have written here:
https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html



No, I will reply it separately and according your suggestion to snip it 
into 3 thread.



We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.


This is not a reasonable argument.  It makes no sense to duplicate code
on Linux because you must duplicate code on Windows.  Let's assume you
must do it in the driver on windows because windows has closed source
drivers.  What does it matter? Linux can still do it as part of DMA API
and have it apply to all drivers.



Sure. Duplicated code should be encapsulated and make it able to reuse
by other drivers. Just like you said the dummy write part.

I meant the framework should not require to change Windows kernel code
(such as PM core or PCI bus driver)and this will block implementation on
the Windows.

I think it's not problem to duplicate code in the Windows drivers.


Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.

The dma page allocated by VF driver also needs to reserve space
to do dummy write.


I suggested ways to do it all in the hypervisor without driver hacks, or
hide it within DMA API without need to reserve extra space. Both
approaches seem much cleaner.



This sounds reasonable. We may discuss it detail in the separate thread.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Lan, Tianyu

On 12/9/2015 7:28 PM, Michael S. Tsirkin wrote:

I remember reading that it's possible to implement a bus driver
on windows if required.  But basically I don't see how windows can be
relevant to discussing guest driver patches. That discussion
probably belongs on the qemu maling list, not on lkml.


I am not sure whether we can write a bus driver for Windows to support
migration. But I think device vendor who want to support migration will 
improve their driver if we provide such framework in the

hypervisor which just need them to change their driver.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Michael S. Tsirkin
On Wed, Dec 09, 2015 at 07:19:15PM +0800, Lan, Tianyu wrote:
> On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote:
> >On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote:
> >>Hi Michael & Alexander:
> >>Thanks a lot for your comments and suggestions.
> >
> >It's nice that it's appreciated, but you then go on and ignore
> >all that I have written here:
> >https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html
> >
> 
> No, I will reply it separately and according your suggestion to snip it into
> 3 thread.
> 
> >>We still need to support Windows guest for migration and this is why our
> >>patches keep all changes in the driver since it's impossible to change
> >>Windows kernel.
> >
> >This is not a reasonable argument.  It makes no sense to duplicate code
> >on Linux because you must duplicate code on Windows.  Let's assume you
> >must do it in the driver on windows because windows has closed source
> >drivers.  What does it matter? Linux can still do it as part of DMA API
> >and have it apply to all drivers.
> >
> 
> Sure. Duplicated code should be encapsulated and make it able to reuse
> by other drivers. Just like you said the dummy write part.
> 
> I meant the framework should not require to change Windows kernel code
> (such as PM core or PCI bus driver)and this will block implementation on
> the Windows.

I remember reading that it's possible to implement a bus driver
on windows if required.  But basically I don't see how windows can be
relevant to discussing guest driver patches. That discussion
probably belongs on the qemu maling list, not on lkml.

> I think it's not problem to duplicate code in the Windows drivers.
> 
> >>Following is my idea to do DMA tracking.
> >>
> >>Inject event to VF driver after memory iterate stage
> >>and before stop VCPU and then VF driver marks dirty all
> >>using DMA memory. The new allocated pages also need to
> >>be marked dirty before stopping VCPU. All dirty memory
> >>in this time slot will be migrated until stop-and-copy
> >>stage. We also need to make sure to disable VF via clearing the
> >>bus master enable bit for VF before migrating these memory.
> >>
> >>The dma page allocated by VF driver also needs to reserve space
> >>to do dummy write.
> >
> >I suggested ways to do it all in the hypervisor without driver hacks, or
> >hide it within DMA API without need to reserve extra space. Both
> >approaches seem much cleaner.
> >
> 
> This sounds reasonable. We may discuss it detail in the separate thread.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-09 Thread Alexander Duyck
On Wed, Dec 9, 2015 at 1:28 AM, Lan, Tianyu  wrote:
>
>
> On 12/8/2015 1:12 AM, Alexander Duyck wrote:
>>
>> On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu  wrote:
>>>
>>> On 12/5/2015 1:07 AM, Alexander Duyck wrote:
>
>
>
> We still need to support Windows guest for migration and this is why
> our
> patches keep all changes in the driver since it's impossible to change
> Windows kernel.



 That is a poor argument.  I highly doubt Microsoft is interested in
 having to modify all of the drivers that will support direct assignment
 in order to support migration.  They would likely request something
 similar to what I have in that they will want a way to do DMA tracking
 with minimal modification required to the drivers.
>>>
>>>
>>>
>>> This totally depends on the NIC or other devices' vendors and they
>>> should make decision to support migration or not. If yes, they would
>>> modify driver.
>>
>>
>> Having to modify every driver that wants to support live migration is
>> a bit much.  In addition I don't see this being limited only to NIC
>> devices.  You can direct assign a number of different devices, your
>> solution cannot be specific to NICs.
>
>
> We are also adding such migration support for QAT device and so our
> solution will not just be limit to NIC. Now just is the beginning.

Agreed, but still QAT is networking related.  My advice would be to
look at something else that works from within a different subsystem
such as storage.  All I am saying is that your solution is very
networking centric.

> We can't limit user to only use Linux guest. So the migration feature
> should work for both Windows and Linux guest.

Right now what your solution is doing is to limit things so that only
the Intel NICs can support this since it will require driver
modification across the board.  Instead what I have proposed should
make it so that once you have done the work there should be very
little work that has to be done on your port to support any device.

>>
>>> If just target to call suspend/resume during migration, the feature will
>>> be meaningless. Most cases don't want to affect user during migration
>>> a lot and so the service down time is vital. Our target is to apply
>>> SRIOV NIC passthough to cloud service and NFV(network functions
>>> virtualization) projects which are sensitive to network performance
>>> and stability. From my opinion, We should give a change for device
>>> driver to implement itself migration job. Call suspend and resume
>>> callback in the driver if it doesn't care the performance during
>>> migration.
>>
>>
>> The suspend/resume callback should be efficient in terms of time.
>> After all we don't want the system to stall for a long period of time
>> when it should be either running or asleep.  Having it burn cycles in
>> a power state limbo doesn't do anyone any good.  If nothing else maybe
>> it will help to push the vendors to speed up those functions which
>> then benefit migration and the system sleep states.
>
>
> If we can benefit both migration and suspend, that would be wonderful.
> But migration and system pm is still different. Just for example,
> driver doesn't need to put device into deep D-status during migration
> and host can do this after migration while it's essential for
> system sleep. PCI configure space and interrupt config is emulated by
> Qemu and Qemu can migrate these configures to new machine. Driver
> doesn't need to deal with such thing. So I think migration still needs a
> different callback or different code path than device suspend/resume.

SR-IOV devices are considered to be in D3 as soon as you clear the bus
master enable bit.  They don't actually have a PCIe power management
block in their configuration space.  The advantage of the
suspend/resume approach is that the D0->D3->D0 series of transitions
should trigger a PCIe reset on the device.  As such the resume call is
capable of fully reinitializing a device.

As far as migrating the interrupts themselves moving live interrupts
is problematic.  You are more likely to throw them out of sync since
the state of the device will not match the state of what you migrated
for things like the pending bit array so if there is a device that
actually depending on those bits you might run into issues.

> Another concern is that we have to rework PM core ore PCI bus driver
> to call suspend/resume for passthrough devices during migration. This
> also blocks new feature works on the Windows.

If I am not mistaken the Windows drivers have a similar feature that
is called when you disable or enable an interface.  I believe the
motivation for using D3 when a device has been disabled is to save
power on the system since in D3 the device should be in its lowest
power state.

>>
>> Also you keep assuming you can keep the device running while you do
>> the migration and you can't.  You are going to corrupt the 

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-07 Thread Lan, Tianyu

On 12/5/2015 1:07 AM, Alexander Duyck wrote:


We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.


That is a poor argument.  I highly doubt Microsoft is interested in
having to modify all of the drivers that will support direct assignment
in order to support migration.  They would likely request something
similar to what I have in that they will want a way to do DMA tracking
with minimal modification required to the drivers.


This totally depends on the NIC or other devices' vendors and they
should make decision to support migration or not. If yes, they would
modify driver.

If just target to call suspend/resume during migration, the feature will
be meaningless. Most cases don't want to affect user during migration
a lot and so the service down time is vital. Our target is to apply
SRIOV NIC passthough to cloud service and NFV(network functions
virtualization) projects which are sensitive to network performance
and stability. From my opinion, We should give a change for device
driver to implement itself migration job. Call suspend and resume
callback in the driver if it doesn't care the performance during migration.




Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.


The ordering of your explanation here doesn't quite work.  What needs to
happen is that you have to disable DMA and then mark the pages as dirty.
  What the disabling of the BME does is signal to the hypervisor that
the device is now stopped.  The ixgbevf_suspend call already supported
by the driver is almost exactly what is needed to take care of something
like this.


This is why I hope to reserve a piece of space in the dma page to do 
dummy write. This can help to mark page dirty while not require to stop 
DMA and not race with DMA data.


If can't do that, we have to stop DMA in a short time to mark all dma
pages dirty and then reenable it. I am not sure how much we can get by
this way to track all DMA memory with device running during migration. I
need to do some tests and compare results with stop DMA diretly at last
stage during migration.




The question is how we would go about triggering it.  I really don't
think the PCI configuration space approach is the right idea.
 I wonder
if we couldn't get away with some sort of ACPI event instead.  We
already require ACPI support in order to shut down the system
gracefully, I wonder if we couldn't get away with something similar in
order to suspend/resume the direct assigned devices gracefully.



I don't think there is such events in the current spec.
Otherwise, There are two kinds of suspend/resume callbacks.
1) System suspend/resume called during S2RAM and S2DISK.
2) Runtime suspend/resume called by pm core when device is idle.
If you want to do what you mentioned, you have to change PM core and
ACPI spec.


The dma page allocated by VF driver also needs to reserve space
to do dummy write.


No, this will not work.  If for example you have a VF driver allocating
memory for a 9K receive how will that work?  It isn't as if you can poke
a hole in the contiguous memory.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-07 Thread Michael S. Tsirkin
On Mon, Dec 07, 2015 at 09:12:08AM -0800, Alexander Duyck wrote:
> On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu  wrote:
> > On 12/5/2015 1:07 AM, Alexander Duyck wrote:
> >>>
> >>>
> >>> We still need to support Windows guest for migration and this is why our
> >>> patches keep all changes in the driver since it's impossible to change
> >>> Windows kernel.
> >>
> >>
> >> That is a poor argument.  I highly doubt Microsoft is interested in
> >> having to modify all of the drivers that will support direct assignment
> >> in order to support migration.  They would likely request something
> >> similar to what I have in that they will want a way to do DMA tracking
> >> with minimal modification required to the drivers.
> >
> >
> > This totally depends on the NIC or other devices' vendors and they
> > should make decision to support migration or not. If yes, they would
> > modify driver.
> 
> Having to modify every driver that wants to support live migration is
> a bit much.  In addition I don't see this being limited only to NIC
> devices.  You can direct assign a number of different devices, your
> solution cannot be specific to NICs.
> 
> > If just target to call suspend/resume during migration, the feature will
> > be meaningless. Most cases don't want to affect user during migration
> > a lot and so the service down time is vital. Our target is to apply
> > SRIOV NIC passthough to cloud service and NFV(network functions
> > virtualization) projects which are sensitive to network performance
> > and stability. From my opinion, We should give a change for device
> > driver to implement itself migration job. Call suspend and resume
> > callback in the driver if it doesn't care the performance during migration.
> 
> The suspend/resume callback should be efficient in terms of time.
> After all we don't want the system to stall for a long period of time
> when it should be either running or asleep.  Having it burn cycles in
> a power state limbo doesn't do anyone any good.  If nothing else maybe
> it will help to push the vendors to speed up those functions which
> then benefit migration and the system sleep states.
> 
> Also you keep assuming you can keep the device running while you do
> the migration and you can't.  You are going to corrupt the memory if
> you do, and you have yet to provide any means to explain how you are
> going to solve that.
> 
> 
> >
> >>
> >>> Following is my idea to do DMA tracking.
> >>>
> >>> Inject event to VF driver after memory iterate stage
> >>> and before stop VCPU and then VF driver marks dirty all
> >>> using DMA memory. The new allocated pages also need to
> >>> be marked dirty before stopping VCPU. All dirty memory
> >>> in this time slot will be migrated until stop-and-copy
> >>> stage. We also need to make sure to disable VF via clearing the
> >>> bus master enable bit for VF before migrating these memory.
> >>
> >>
> >> The ordering of your explanation here doesn't quite work.  What needs to
> >> happen is that you have to disable DMA and then mark the pages as dirty.
> >>   What the disabling of the BME does is signal to the hypervisor that
> >> the device is now stopped.  The ixgbevf_suspend call already supported
> >> by the driver is almost exactly what is needed to take care of something
> >> like this.
> >
> >
> > This is why I hope to reserve a piece of space in the dma page to do dummy
> > write. This can help to mark page dirty while not require to stop DMA and
> > not race with DMA data.
> 
> You can't and it will still race.  What concerns me is that your
> patches and the document you referenced earlier show a considerable
> lack of understanding about how DMA and device drivers work.  There is
> a reason why device drivers have so many memory barriers and the like
> in them.  The fact is when you have CPU and a device both accessing
> memory things have to be done in a very specific order and you cannot
> violate that.
> 
> If you have a contiguous block of memory you expect the device to
> write into you cannot just poke a hole in it.  Such a situation is not
> supported by any hardware that I am aware of.
> 
> As far as writing to dirty the pages it only works so long as you halt
> the DMA and then mark the pages dirty.  It has to be in that order.
> Any other order will result in data corruption and I am sure the NFV
> customers definitely don't want that.
> 
> > If can't do that, we have to stop DMA in a short time to mark all dma
> > pages dirty and then reenable it. I am not sure how much we can get by
> > this way to track all DMA memory with device running during migration. I
> > need to do some tests and compare results with stop DMA diretly at last
> > stage during migration.
> 
> We have to halt the DMA before we can complete the migration.  So
> please feel free to test this.
> 
> In addition I still feel you would be better off taking this in
> smaller steps.  I still say your first step would be to come up with a
> generic 

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-07 Thread Alexander Duyck
On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu  wrote:
> On 12/5/2015 1:07 AM, Alexander Duyck wrote:
>>>
>>>
>>> We still need to support Windows guest for migration and this is why our
>>> patches keep all changes in the driver since it's impossible to change
>>> Windows kernel.
>>
>>
>> That is a poor argument.  I highly doubt Microsoft is interested in
>> having to modify all of the drivers that will support direct assignment
>> in order to support migration.  They would likely request something
>> similar to what I have in that they will want a way to do DMA tracking
>> with minimal modification required to the drivers.
>
>
> This totally depends on the NIC or other devices' vendors and they
> should make decision to support migration or not. If yes, they would
> modify driver.

Having to modify every driver that wants to support live migration is
a bit much.  In addition I don't see this being limited only to NIC
devices.  You can direct assign a number of different devices, your
solution cannot be specific to NICs.

> If just target to call suspend/resume during migration, the feature will
> be meaningless. Most cases don't want to affect user during migration
> a lot and so the service down time is vital. Our target is to apply
> SRIOV NIC passthough to cloud service and NFV(network functions
> virtualization) projects which are sensitive to network performance
> and stability. From my opinion, We should give a change for device
> driver to implement itself migration job. Call suspend and resume
> callback in the driver if it doesn't care the performance during migration.

The suspend/resume callback should be efficient in terms of time.
After all we don't want the system to stall for a long period of time
when it should be either running or asleep.  Having it burn cycles in
a power state limbo doesn't do anyone any good.  If nothing else maybe
it will help to push the vendors to speed up those functions which
then benefit migration and the system sleep states.

Also you keep assuming you can keep the device running while you do
the migration and you can't.  You are going to corrupt the memory if
you do, and you have yet to provide any means to explain how you are
going to solve that.


>
>>
>>> Following is my idea to do DMA tracking.
>>>
>>> Inject event to VF driver after memory iterate stage
>>> and before stop VCPU and then VF driver marks dirty all
>>> using DMA memory. The new allocated pages also need to
>>> be marked dirty before stopping VCPU. All dirty memory
>>> in this time slot will be migrated until stop-and-copy
>>> stage. We also need to make sure to disable VF via clearing the
>>> bus master enable bit for VF before migrating these memory.
>>
>>
>> The ordering of your explanation here doesn't quite work.  What needs to
>> happen is that you have to disable DMA and then mark the pages as dirty.
>>   What the disabling of the BME does is signal to the hypervisor that
>> the device is now stopped.  The ixgbevf_suspend call already supported
>> by the driver is almost exactly what is needed to take care of something
>> like this.
>
>
> This is why I hope to reserve a piece of space in the dma page to do dummy
> write. This can help to mark page dirty while not require to stop DMA and
> not race with DMA data.

You can't and it will still race.  What concerns me is that your
patches and the document you referenced earlier show a considerable
lack of understanding about how DMA and device drivers work.  There is
a reason why device drivers have so many memory barriers and the like
in them.  The fact is when you have CPU and a device both accessing
memory things have to be done in a very specific order and you cannot
violate that.

If you have a contiguous block of memory you expect the device to
write into you cannot just poke a hole in it.  Such a situation is not
supported by any hardware that I am aware of.

As far as writing to dirty the pages it only works so long as you halt
the DMA and then mark the pages dirty.  It has to be in that order.
Any other order will result in data corruption and I am sure the NFV
customers definitely don't want that.

> If can't do that, we have to stop DMA in a short time to mark all dma
> pages dirty and then reenable it. I am not sure how much we can get by
> this way to track all DMA memory with device running during migration. I
> need to do some tests and compare results with stop DMA diretly at last
> stage during migration.

We have to halt the DMA before we can complete the migration.  So
please feel free to test this.

In addition I still feel you would be better off taking this in
smaller steps.  I still say your first step would be to come up with a
generic solution for the dirty page tracking like the dma_mark_clean()
approach I had mentioned earlier.  If I get time I might try to take
care of it myself later this week since you don't seem to agree with
that approach.

>>
>> The question is how we would go about 

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-07 Thread Alexander Duyck
On Mon, Dec 7, 2015 at 9:39 AM, Michael S. Tsirkin  wrote:
> On Mon, Dec 07, 2015 at 09:12:08AM -0800, Alexander Duyck wrote:
>> On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu  wrote:
>> > On 12/5/2015 1:07 AM, Alexander Duyck wrote:

>> > If can't do that, we have to stop DMA in a short time to mark all dma
>> > pages dirty and then reenable it. I am not sure how much we can get by
>> > this way to track all DMA memory with device running during migration. I
>> > need to do some tests and compare results with stop DMA diretly at last
>> > stage during migration.
>>
>> We have to halt the DMA before we can complete the migration.  So
>> please feel free to test this.
>>
>> In addition I still feel you would be better off taking this in
>> smaller steps.  I still say your first step would be to come up with a
>> generic solution for the dirty page tracking like the dma_mark_clean()
>> approach I had mentioned earlier.  If I get time I might try to take
>> care of it myself later this week since you don't seem to agree with
>> that approach.
>
> Or even try to look at the dirty bit in the VT-D PTEs
> on the host. See the mail I have just sent.
> Might be slower, or might be faster, but is completely
> transparent.

I just saw it and I am looking over the VTd spec now.  It looks like
there might be some performance impacts if software is changing the
PTEs since then the VTd harwdare cannot cache them.  I still have to
do some more reading though so I can fully understand the impacts.

>> >>
>> >> The question is how we would go about triggering it.  I really don't
>> >> think the PCI configuration space approach is the right idea.
>> >>  I wonder
>> >> if we couldn't get away with some sort of ACPI event instead.  We
>> >> already require ACPI support in order to shut down the system
>> >> gracefully, I wonder if we couldn't get away with something similar in
>> >> order to suspend/resume the direct assigned devices gracefully.
>> >>
>> >
>> > I don't think there is such events in the current spec.
>> > Otherwise, There are two kinds of suspend/resume callbacks.
>> > 1) System suspend/resume called during S2RAM and S2DISK.
>> > 2) Runtime suspend/resume called by pm core when device is idle.
>> > If you want to do what you mentioned, you have to change PM core and
>> > ACPI spec.
>>
>> The thought I had was to somehow try to move the direct assigned
>> devices into their own power domain and then simulate a AC power event
>> where that domain is switched off.  However I don't know if there are
>> ACPI events to support that since the power domain code currently only
>> appears to be in use for runtime power management.
>>
>> That had also given me the thought to look at something like runtime
>> power management for the VFs.  We would need to do a runtime
>> suspend/resume.  The only problem is I don't know if there is any way
>> to get the VFs to do a quick wakeup.  It might be worthwhile looking
>> at trying to check with the ACPI experts out there to see if there is
>> anything we can do as bypassing having to use the configuration space
>> mechanism to signal this would definitely be worth it.
>
> I don't much like this idea because it relies on the
> device being exactly the same across source/destination.
> After all, this is always true for suspend/resume.
> Most users do not have control over this, and you would
> often get sightly different versions of firmware,
> etc without noticing.

The original code was operating on that assumption as well.  That is
kind of why I suggested suspend/resume rather than reinventing the
wheel.

> I think we should first see how far along we can get
> by doing a full device reset, and only carrying over
> high level state such as IP, MAC, ARP cache etc.

One advantage of the suspend/resume approach is that it is compatible
with a full reset.  The suspend/resume approach assumes the device
goes through a D0->D3->D0 reset as a part of transitioning between the
system states.

I do admit though that the PCI spec says you aren't supposed to be
hot-swapping devices while the system is in a sleep state so odds are
you would encounter issues if the device changed in any significant
way.

>> >>> The dma page allocated by VF driver also needs to reserve space
>> >>> to do dummy write.
>> >>
>> >>
>> >> No, this will not work.  If for example you have a VF driver allocating
>> >> memory for a 9K receive how will that work?  It isn't as if you can poke
>> >> a hole in the contiguous memory.
>>
>> This is the bit that makes your "poke a hole" solution not portable to
>> other drivers.  I don't know if you overlooked it but for many NICs
>> jumbo frames means using large memory allocations to receive the data.
>> That is the way ixgbevf was up until about a year ago so you cannot
>> expect all the drivers that will want migration support to allow a
>> space for you to write to.  In addition some storage drivers have to
>> map an entire page, 

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-04 Thread Alexander Duyck

On 12/04/2015 08:32 AM, Lan, Tianyu wrote:

Hi Michael & Alexander:
Thanks a lot for your comments and suggestions.

We still need to support Windows guest for migration and this is why our
patches keep all changes in the driver since it's impossible to change
Windows kernel.


That is a poor argument.  I highly doubt Microsoft is interested in 
having to modify all of the drivers that will support direct assignment 
in order to support migration.  They would likely request something 
similar to what I have in that they will want a way to do DMA tracking 
with minimal modification required to the drivers.



Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.


The ordering of your explanation here doesn't quite work.  What needs to 
happen is that you have to disable DMA and then mark the pages as dirty. 
 What the disabling of the BME does is signal to the hypervisor that 
the device is now stopped.  The ixgbevf_suspend call already supported 
by the driver is almost exactly what is needed to take care of something 
like this.


The question is how we would go about triggering it.  I really don't 
think the PCI configuration space approach is the right idea.  I wonder 
if we couldn't get away with some sort of ACPI event instead.  We 
already require ACPI support in order to shut down the system 
gracefully, I wonder if we couldn't get away with something similar in 
order to suspend/resume the direct assigned devices gracefully.



The dma page allocated by VF driver also needs to reserve space
to do dummy write.


No, this will not work.  If for example you have a VF driver allocating 
memory for a 9K receive how will that work?  It isn't as if you can poke 
a hole in the contiguous memory.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-04 Thread Lan, Tianyu

Hi Michael & Alexander:
Thanks a lot for your comments and suggestions.

We still need to support Windows guest for migration and this is why our 
patches keep all changes in the driver since it's impossible to change 
Windows kernel.


Following is my idea to do DMA tracking.

Inject event to VF driver after memory iterate stage
and before stop VCPU and then VF driver marks dirty all
using DMA memory. The new allocated pages also need to
be marked dirty before stopping VCPU. All dirty memory
in this time slot will be migrated until stop-and-copy
stage. We also need to make sure to disable VF via clearing the
bus master enable bit for VF before migrating these memory.

The dma page allocated by VF driver also needs to reserve space
to do dummy write.


On 12/2/2015 7:44 PM, Michael S. Tsirkin wrote:

On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote:

On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin  wrote:

On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote:

On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin  wrote:



There are several components to this:
- dma_map_* needs to prevent page from
   being migrated while device is running.
   For example, expose some kind of bitmap from guest
   to host, set bit there while page is mapped.
   What happens if we stop the guest and some
   bits are still set? See dma_alloc_coherent below
   for some ideas.


Yeah, I could see something like this working.  Maybe we could do
something like what was done for the NX bit and make use of the upper
order bits beyond the limits of the memory range to mark pages as
non-migratable?

I'm curious.  What we have with a DMA mapped region is essentially
shared memory between the guest and the device.  How would we resolve
something like this with IVSHMEM, or are we blocked there as well in
terms of migration?


I have some ideas. Will post later.


I look forward to it.


- dma_unmap_* needs to mark page as dirty
   This can be done by writing into a page.

- dma_sync_* needs to mark page as dirty
   This is trickier as we can not change the data.
   One solution is using atomics.
   For example:
 int x = ACCESS_ONCE(*p);
 cmpxchg(p, x, x);
   Seems to do a write without changing page
   contents.


Like I said we can probably kill 2 birds with one stone by just
implementing our own dma_mark_clean() for x86 virtualized
environments.

I'd say we could take your solution one step further and just use 0
instead of bothering to read the value.  After all it won't write the
area if the value at the offset is not 0.


Really almost any atomic that has no side effect will do.
atomic or with 0
atomic and with 

It's just that cmpxchg already happens to have a portable
wrapper.


I was originally thinking maybe an atomic_add with 0 would be the way
to go.


cmpxchg with any value too.


  Either way though we still are using a locked prefix and
having to dirty a cache line per page which is going to come at some
cost.


I agree. It's likely not necessary for everyone
to be doing this: only people that both
run within the VM and want migration to work
need to do this logging.

So set some module option to have driver tell hypervisor that it
supports logging.  If bus mastering is enabled before this, migration is
blocked.  Or even pass some flag from hypervisor so
driver can detect it needs to log writes.
I guess this could be put in device config somewhere,
though in practice it's a global thing, not a per device one, so
maybe we need some new channel to
pass this flag to guest. CPUID?
Or maybe we can put some kind of agent in the initrd
and use the existing guest agent channel after all.
agent in initrd could open up a lot of new possibilities.



- dma_alloc_coherent memory (e.g. device rings)
   must be migrated after device stopped modifying it.
   Just stopping the VCPU is not enough:
   you must make sure device is not changing it.

   Or maybe the device has some kind of ring flush operation,
   if there was a reasonably portable way to do this
   (e.g. a flush capability could maybe be added to SRIOV)
   then hypervisor could do this.


This is where things start to get messy. I was suggesting the
suspend/resume to resolve this bit, but it might be possible to also
deal with this via something like this via clearing the bus master
enable bit for the VF.  If I am not mistaken that should disable MSI-X
interrupts and halt any DMA.  That should work as long as you have
some mechanism that is tracking the pages in use for DMA.


A bigger issue is recovering afterwards.


Agreed.


   In case you need to resume on source, you
   really need to follow the same path
   as on destination, preferably detecting
   device reset and restoring the device
   state.


The problem with detecting the reset is that you would likely have to
be polling to do something like that.


We could some event to guest to notify it about this event
through a new or 

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-02 Thread Michael S. Tsirkin
On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote:
> On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin  wrote:
> > On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote:
> >> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin  wrote:
> 
> >> > There are several components to this:
> >> > - dma_map_* needs to prevent page from
> >> >   being migrated while device is running.
> >> >   For example, expose some kind of bitmap from guest
> >> >   to host, set bit there while page is mapped.
> >> >   What happens if we stop the guest and some
> >> >   bits are still set? See dma_alloc_coherent below
> >> >   for some ideas.
> >>
> >> Yeah, I could see something like this working.  Maybe we could do
> >> something like what was done for the NX bit and make use of the upper
> >> order bits beyond the limits of the memory range to mark pages as
> >> non-migratable?
> >>
> >> I'm curious.  What we have with a DMA mapped region is essentially
> >> shared memory between the guest and the device.  How would we resolve
> >> something like this with IVSHMEM, or are we blocked there as well in
> >> terms of migration?
> >
> > I have some ideas. Will post later.
> 
> I look forward to it.
> 
> >> > - dma_unmap_* needs to mark page as dirty
> >> >   This can be done by writing into a page.
> >> >
> >> > - dma_sync_* needs to mark page as dirty
> >> >   This is trickier as we can not change the data.
> >> >   One solution is using atomics.
> >> >   For example:
> >> > int x = ACCESS_ONCE(*p);
> >> > cmpxchg(p, x, x);
> >> >   Seems to do a write without changing page
> >> >   contents.
> >>
> >> Like I said we can probably kill 2 birds with one stone by just
> >> implementing our own dma_mark_clean() for x86 virtualized
> >> environments.
> >>
> >> I'd say we could take your solution one step further and just use 0
> >> instead of bothering to read the value.  After all it won't write the
> >> area if the value at the offset is not 0.
> >
> > Really almost any atomic that has no side effect will do.
> > atomic or with 0
> > atomic and with 
> >
> > It's just that cmpxchg already happens to have a portable
> > wrapper.
> 
> I was originally thinking maybe an atomic_add with 0 would be the way
> to go.

cmpxchg with any value too.

>  Either way though we still are using a locked prefix and
> having to dirty a cache line per page which is going to come at some
> cost.

I agree. It's likely not necessary for everyone
to be doing this: only people that both
run within the VM and want migration to work
need to do this logging.

So set some module option to have driver tell hypervisor that it
supports logging.  If bus mastering is enabled before this, migration is
blocked.  Or even pass some flag from hypervisor so
driver can detect it needs to log writes.
I guess this could be put in device config somewhere,
though in practice it's a global thing, not a per device one, so
maybe we need some new channel to
pass this flag to guest. CPUID?
Or maybe we can put some kind of agent in the initrd
and use the existing guest agent channel after all.
agent in initrd could open up a lot of new possibilities.


> >> > - dma_alloc_coherent memory (e.g. device rings)
> >> >   must be migrated after device stopped modifying it.
> >> >   Just stopping the VCPU is not enough:
> >> >   you must make sure device is not changing it.
> >> >
> >> >   Or maybe the device has some kind of ring flush operation,
> >> >   if there was a reasonably portable way to do this
> >> >   (e.g. a flush capability could maybe be added to SRIOV)
> >> >   then hypervisor could do this.
> >>
> >> This is where things start to get messy. I was suggesting the
> >> suspend/resume to resolve this bit, but it might be possible to also
> >> deal with this via something like this via clearing the bus master
> >> enable bit for the VF.  If I am not mistaken that should disable MSI-X
> >> interrupts and halt any DMA.  That should work as long as you have
> >> some mechanism that is tracking the pages in use for DMA.
> >
> > A bigger issue is recovering afterwards.
> 
> Agreed.
> 
> >> >   In case you need to resume on source, you
> >> >   really need to follow the same path
> >> >   as on destination, preferably detecting
> >> >   device reset and restoring the device
> >> >   state.
> >>
> >> The problem with detecting the reset is that you would likely have to
> >> be polling to do something like that.
> >
> > We could some event to guest to notify it about this event
> > through a new or existing channel.
> >
> > Or we could make it possible for userspace to trigger this,
> > then notify guest through the guest agent.
> 
> The first thing that comes to mind would be to use something like PCIe
> Advanced Error Reporting, however I don't know if we can put a
> requirement on the system supporting the q35 machine type or not in
> order to support migration.

You mean require pci express? This 

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-01 Thread Alexander Duyck
On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin  wrote:
> On Tue, Dec 01, 2015 at 11:04:31PM +0800, Lan, Tianyu wrote:
>>
>>
>> On 12/1/2015 12:07 AM, Alexander Duyck wrote:
>> >They can only be corrected if the underlying assumptions are correct
>> >and they aren't.  Your solution would have never worked correctly.
>> >The problem is you assume you can keep the device running when you are
>> >migrating and you simply cannot.  At some point you will always have
>> >to stop the device in order to complete the migration, and you cannot
>> >stop it before you have stopped your page tracking mechanism.  So
>> >unless the platform has an IOMMU that is somehow taking part in the
>> >dirty page tracking you will not be able to stop the guest and then
>> >the device, it will have to be the device and then the guest.
>> >
>> >>>Doing suspend and resume() may help to do migration easily but some
>> >>>devices requires low service down time. Especially network and I got
>> >>>that some cloud company promised less than 500ms network service downtime.
>> >Honestly focusing on the downtime is getting the cart ahead of the
>> >horse.  First you need to be able to do this without corrupting system
>> >memory and regardless of the state of the device.  You haven't even
>> >gotten to that state yet.  Last I knew the device had to be up in
>> >order for your migration to even work.
>>
>> I think the issue is that the content of rx package delivered to stack maybe
>> changed during migration because the piece of memory won't be migrated to
>> new machine. This may confuse applications or stack. Current dummy write
>> solution can ensure the content of package won't change after doing dummy
>> write while the content maybe not received data if migration happens before
>> that point. We can recheck the content via checksum or crc in the protocol
>> after dummy write to ensure the content is what VF received. I think stack
>> has already done such checks and the package will be abandoned if failed to
>> pass through the check.
>
>
> Most people nowdays rely on hardware checksums so I don't think this can
> fly.

Correct.  The checksum/crc approach will not work since it is possible
for a checksum to even be mangled in the case of some features such as
LRO or GRO.

>> Another way is to tell all memory driver are using to Qemu and let Qemu to
>> migrate these memory after stopping VCPU and the device. This seems safe but
>> implementation maybe complex.
>
> Not really 100% safe.  See below.
>
> I think hiding these details behind dma_* API does have
> some appeal. In any case, it gives us a good
> terminology as it covers what most drivers do.

That was kind of my thought.  If we were to build our own
dma_mark_clean() type function that will mark the DMA region dirty on
sync or unmap then that is half the battle right there as we would be
able to at least keep the regions consistent after they have left the
driver.

> There are several components to this:
> - dma_map_* needs to prevent page from
>   being migrated while device is running.
>   For example, expose some kind of bitmap from guest
>   to host, set bit there while page is mapped.
>   What happens if we stop the guest and some
>   bits are still set? See dma_alloc_coherent below
>   for some ideas.

Yeah, I could see something like this working.  Maybe we could do
something like what was done for the NX bit and make use of the upper
order bits beyond the limits of the memory range to mark pages as
non-migratable?

I'm curious.  What we have with a DMA mapped region is essentially
shared memory between the guest and the device.  How would we resolve
something like this with IVSHMEM, or are we blocked there as well in
terms of migration?

> - dma_unmap_* needs to mark page as dirty
>   This can be done by writing into a page.
>
> - dma_sync_* needs to mark page as dirty
>   This is trickier as we can not change the data.
>   One solution is using atomics.
>   For example:
> int x = ACCESS_ONCE(*p);
> cmpxchg(p, x, x);
>   Seems to do a write without changing page
>   contents.

Like I said we can probably kill 2 birds with one stone by just
implementing our own dma_mark_clean() for x86 virtualized
environments.

I'd say we could take your solution one step further and just use 0
instead of bothering to read the value.  After all it won't write the
area if the value at the offset is not 0.  The only downside is that
this is a locked operation so we will take a pretty serious
performance penalty when this is active.  As such my preference would
be to hide the code behind some static key that we could then switch
on in the event of a VM being migrated.

> - dma_alloc_coherent memory (e.g. device rings)
>   must be migrated after device stopped modifying it.
>   Just stopping the VCPU is not enough:
>   you must make sure device is not changing it.
>
>   Or maybe the device has some kind of ring flush operation,
>   if there was a 

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-01 Thread Michael S. Tsirkin
On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote:
> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin  wrote:
> > On Tue, Dec 01, 2015 at 11:04:31PM +0800, Lan, Tianyu wrote:
> >>
> >>
> >> On 12/1/2015 12:07 AM, Alexander Duyck wrote:
> >> >They can only be corrected if the underlying assumptions are correct
> >> >and they aren't.  Your solution would have never worked correctly.
> >> >The problem is you assume you can keep the device running when you are
> >> >migrating and you simply cannot.  At some point you will always have
> >> >to stop the device in order to complete the migration, and you cannot
> >> >stop it before you have stopped your page tracking mechanism.  So
> >> >unless the platform has an IOMMU that is somehow taking part in the
> >> >dirty page tracking you will not be able to stop the guest and then
> >> >the device, it will have to be the device and then the guest.
> >> >
> >> >>>Doing suspend and resume() may help to do migration easily but some
> >> >>>devices requires low service down time. Especially network and I got
> >> >>>that some cloud company promised less than 500ms network service 
> >> >>>downtime.
> >> >Honestly focusing on the downtime is getting the cart ahead of the
> >> >horse.  First you need to be able to do this without corrupting system
> >> >memory and regardless of the state of the device.  You haven't even
> >> >gotten to that state yet.  Last I knew the device had to be up in
> >> >order for your migration to even work.
> >>
> >> I think the issue is that the content of rx package delivered to stack 
> >> maybe
> >> changed during migration because the piece of memory won't be migrated to
> >> new machine. This may confuse applications or stack. Current dummy write
> >> solution can ensure the content of package won't change after doing dummy
> >> write while the content maybe not received data if migration happens before
> >> that point. We can recheck the content via checksum or crc in the protocol
> >> after dummy write to ensure the content is what VF received. I think stack
> >> has already done such checks and the package will be abandoned if failed to
> >> pass through the check.
> >
> >
> > Most people nowdays rely on hardware checksums so I don't think this can
> > fly.
> 
> Correct.  The checksum/crc approach will not work since it is possible
> for a checksum to even be mangled in the case of some features such as
> LRO or GRO.
> 
> >> Another way is to tell all memory driver are using to Qemu and let Qemu to
> >> migrate these memory after stopping VCPU and the device. This seems safe 
> >> but
> >> implementation maybe complex.
> >
> > Not really 100% safe.  See below.
> >
> > I think hiding these details behind dma_* API does have
> > some appeal. In any case, it gives us a good
> > terminology as it covers what most drivers do.
> 
> That was kind of my thought.  If we were to build our own
> dma_mark_clean() type function that will mark the DMA region dirty on
> sync or unmap then that is half the battle right there as we would be
> able to at least keep the regions consistent after they have left the
> driver.
> 
> > There are several components to this:
> > - dma_map_* needs to prevent page from
> >   being migrated while device is running.
> >   For example, expose some kind of bitmap from guest
> >   to host, set bit there while page is mapped.
> >   What happens if we stop the guest and some
> >   bits are still set? See dma_alloc_coherent below
> >   for some ideas.
> 
> Yeah, I could see something like this working.  Maybe we could do
> something like what was done for the NX bit and make use of the upper
> order bits beyond the limits of the memory range to mark pages as
> non-migratable?
> 
> I'm curious.  What we have with a DMA mapped region is essentially
> shared memory between the guest and the device.  How would we resolve
> something like this with IVSHMEM, or are we blocked there as well in
> terms of migration?

I have some ideas. Will post later.

> > - dma_unmap_* needs to mark page as dirty
> >   This can be done by writing into a page.
> >
> > - dma_sync_* needs to mark page as dirty
> >   This is trickier as we can not change the data.
> >   One solution is using atomics.
> >   For example:
> > int x = ACCESS_ONCE(*p);
> > cmpxchg(p, x, x);
> >   Seems to do a write without changing page
> >   contents.
> 
> Like I said we can probably kill 2 birds with one stone by just
> implementing our own dma_mark_clean() for x86 virtualized
> environments.
> 
> I'd say we could take your solution one step further and just use 0
> instead of bothering to read the value.  After all it won't write the
> area if the value at the offset is not 0.

Really almost any atomic that has no side effect will do.
atomic or with 0
atomic and with 

It's just that cmpxchg already happens to have a portable
wrapper.

> The only downside is that
> this is a locked operation so we will take a 

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-01 Thread Alexander Duyck
On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin  wrote:
> On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote:
>> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin  wrote:

>> > There are several components to this:
>> > - dma_map_* needs to prevent page from
>> >   being migrated while device is running.
>> >   For example, expose some kind of bitmap from guest
>> >   to host, set bit there while page is mapped.
>> >   What happens if we stop the guest and some
>> >   bits are still set? See dma_alloc_coherent below
>> >   for some ideas.
>>
>> Yeah, I could see something like this working.  Maybe we could do
>> something like what was done for the NX bit and make use of the upper
>> order bits beyond the limits of the memory range to mark pages as
>> non-migratable?
>>
>> I'm curious.  What we have with a DMA mapped region is essentially
>> shared memory between the guest and the device.  How would we resolve
>> something like this with IVSHMEM, or are we blocked there as well in
>> terms of migration?
>
> I have some ideas. Will post later.

I look forward to it.

>> > - dma_unmap_* needs to mark page as dirty
>> >   This can be done by writing into a page.
>> >
>> > - dma_sync_* needs to mark page as dirty
>> >   This is trickier as we can not change the data.
>> >   One solution is using atomics.
>> >   For example:
>> > int x = ACCESS_ONCE(*p);
>> > cmpxchg(p, x, x);
>> >   Seems to do a write without changing page
>> >   contents.
>>
>> Like I said we can probably kill 2 birds with one stone by just
>> implementing our own dma_mark_clean() for x86 virtualized
>> environments.
>>
>> I'd say we could take your solution one step further and just use 0
>> instead of bothering to read the value.  After all it won't write the
>> area if the value at the offset is not 0.
>
> Really almost any atomic that has no side effect will do.
> atomic or with 0
> atomic and with 
>
> It's just that cmpxchg already happens to have a portable
> wrapper.

I was originally thinking maybe an atomic_add with 0 would be the way
to go.  Either way though we still are using a locked prefix and
having to dirty a cache line per page which is going to come at some
cost.

>> > - dma_alloc_coherent memory (e.g. device rings)
>> >   must be migrated after device stopped modifying it.
>> >   Just stopping the VCPU is not enough:
>> >   you must make sure device is not changing it.
>> >
>> >   Or maybe the device has some kind of ring flush operation,
>> >   if there was a reasonably portable way to do this
>> >   (e.g. a flush capability could maybe be added to SRIOV)
>> >   then hypervisor could do this.
>>
>> This is where things start to get messy. I was suggesting the
>> suspend/resume to resolve this bit, but it might be possible to also
>> deal with this via something like this via clearing the bus master
>> enable bit for the VF.  If I am not mistaken that should disable MSI-X
>> interrupts and halt any DMA.  That should work as long as you have
>> some mechanism that is tracking the pages in use for DMA.
>
> A bigger issue is recovering afterwards.

Agreed.

>> >   In case you need to resume on source, you
>> >   really need to follow the same path
>> >   as on destination, preferably detecting
>> >   device reset and restoring the device
>> >   state.
>>
>> The problem with detecting the reset is that you would likely have to
>> be polling to do something like that.
>
> We could some event to guest to notify it about this event
> through a new or existing channel.
>
> Or we could make it possible for userspace to trigger this,
> then notify guest through the guest agent.

The first thing that comes to mind would be to use something like PCIe
Advanced Error Reporting, however I don't know if we can put a
requirement on the system supporting the q35 machine type or not in
order to support migration.

>>  I believe the fm10k driver
>> already has code like that in place where it will detect a reset as a
>> part of its watchdog, however the response time is something like 2
>> seconds for that.  That was one of the reasons I preferred something
>> like hot-plug as that should be functioning as soon as the guest is up
>> and it is a mechanism that operates outside of the VF drivers.
>
> That's pretty minor.
> A bigger issue is making sure guest does not crash
> when device is suddenly reset under it's legs.

I know the ixgbevf driver should already have logic to address some of
that.  If you look through the code there should be logic there for a
surprise removal support in ixgbevf.  The only issue is that unlike
fm10k it will not restore itself after a resume or slot_reset call.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-01 Thread Michael S. Tsirkin
On Tue, Dec 01, 2015 at 11:04:31PM +0800, Lan, Tianyu wrote:
> 
> 
> On 12/1/2015 12:07 AM, Alexander Duyck wrote:
> >They can only be corrected if the underlying assumptions are correct
> >and they aren't.  Your solution would have never worked correctly.
> >The problem is you assume you can keep the device running when you are
> >migrating and you simply cannot.  At some point you will always have
> >to stop the device in order to complete the migration, and you cannot
> >stop it before you have stopped your page tracking mechanism.  So
> >unless the platform has an IOMMU that is somehow taking part in the
> >dirty page tracking you will not be able to stop the guest and then
> >the device, it will have to be the device and then the guest.
> >
> >>>Doing suspend and resume() may help to do migration easily but some
> >>>devices requires low service down time. Especially network and I got
> >>>that some cloud company promised less than 500ms network service downtime.
> >Honestly focusing on the downtime is getting the cart ahead of the
> >horse.  First you need to be able to do this without corrupting system
> >memory and regardless of the state of the device.  You haven't even
> >gotten to that state yet.  Last I knew the device had to be up in
> >order for your migration to even work.
> 
> I think the issue is that the content of rx package delivered to stack maybe
> changed during migration because the piece of memory won't be migrated to
> new machine. This may confuse applications or stack. Current dummy write
> solution can ensure the content of package won't change after doing dummy
> write while the content maybe not received data if migration happens before
> that point. We can recheck the content via checksum or crc in the protocol
> after dummy write to ensure the content is what VF received. I think stack
> has already done such checks and the package will be abandoned if failed to
> pass through the check.


Most people nowdays rely on hardware checksums so I don't think this can
fly.

> Another way is to tell all memory driver are using to Qemu and let Qemu to
> migrate these memory after stopping VCPU and the device. This seems safe but
> implementation maybe complex.

Not really 100% safe.  See below.

I think hiding these details behind dma_* API does have
some appeal. In any case, it gives us a good
terminology as it covers what most drivers do.

There are several components to this:
- dma_map_* needs to prevent page from
  being migrated while device is running.
  For example, expose some kind of bitmap from guest
  to host, set bit there while page is mapped.
  What happens if we stop the guest and some
  bits are still set? See dma_alloc_coherent below
  for some ideas.


- dma_unmap_* needs to mark page as dirty
  This can be done by writing into a page.

- dma_sync_* needs to mark page as dirty
  This is trickier as we can not change the data.
  One solution is using atomics.
  For example:
int x = ACCESS_ONCE(*p);
cmpxchg(p, x, x);
  Seems to do a write without changing page
  contents.

- dma_alloc_coherent memory (e.g. device rings)
  must be migrated after device stopped modifying it.
  Just stopping the VCPU is not enough:
  you must make sure device is not changing it.

  Or maybe the device has some kind of ring flush operation,
  if there was a reasonably portable way to do this
  (e.g. a flush capability could maybe be added to SRIOV)
  then hypervisor could do this.

  With existing devices,
  either do it after device reset, or disable
  memory access in the IOMMU. Maybe both.

  In case you need to resume on source, you
  really need to follow the same path
  as on destination, preferably detecting
  device reset and restoring the device
  state.

  A similar approach could work for dma_map_ above.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-12-01 Thread Lan, Tianyu



On 12/1/2015 12:07 AM, Alexander Duyck wrote:

They can only be corrected if the underlying assumptions are correct
and they aren't.  Your solution would have never worked correctly.
The problem is you assume you can keep the device running when you are
migrating and you simply cannot.  At some point you will always have
to stop the device in order to complete the migration, and you cannot
stop it before you have stopped your page tracking mechanism.  So
unless the platform has an IOMMU that is somehow taking part in the
dirty page tracking you will not be able to stop the guest and then
the device, it will have to be the device and then the guest.


>Doing suspend and resume() may help to do migration easily but some
>devices requires low service down time. Especially network and I got
>that some cloud company promised less than 500ms network service downtime.

Honestly focusing on the downtime is getting the cart ahead of the
horse.  First you need to be able to do this without corrupting system
memory and regardless of the state of the device.  You haven't even
gotten to that state yet.  Last I knew the device had to be up in
order for your migration to even work.


I think the issue is that the content of rx package delivered to stack 
maybe changed during migration because the piece of memory won't be 
migrated to new machine. This may confuse applications or stack. Current 
dummy write solution can ensure the content of package won't change 
after doing dummy write while the content maybe not received data if 
migration happens before that point. We can recheck the content via 
checksum or crc in the protocol after dummy write to ensure the content 
is what VF received. I think stack has already done such checks and the 
package will be abandoned if failed to pass through the check.


Another way is to tell all memory driver are using to Qemu and let Qemu 
to migrate these memory after stopping VCPU and the device. This seems 
safe but implementation maybe complex.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-30 Thread Alexander Duyck
On Sun, Nov 29, 2015 at 10:53 PM, Lan, Tianyu  wrote:
> On 11/26/2015 11:56 AM, Alexander Duyck wrote:
>>
>> > I am not saying you cannot modify the drivers, however what you are
>> doing is far too invasive.  Do you seriously plan on modifying all of
>> the PCI device drivers out there in order to allow any device that
>> might be direct assigned to a port to support migration?  I certainly
>> hope not.  That is why I have said that this solution will not scale.
>
>
> Current drivers are not migration friendly. If the driver wants to
> support migration, it's necessary to be changed.

Modifying all of the drivers directly will not solve the issue though.
This is why I have suggested looking at possibly implementing
something like dma_mark_clean() which is used for ia64 architectures
to mark pages that were DMAed in as clean.  In your case though you
would want to mark such pages as dirty so that the page migration will
notice them and move them over.

> RFC PATCH V1 presented our ideas about how to deal with MMIO, ring and
> DMA tracking during migration. These are common for most drivers and
> they maybe problematic in the previous version but can be corrected later.

They can only be corrected if the underlying assumptions are correct
and they aren't.  Your solution would have never worked correctly.
The problem is you assume you can keep the device running when you are
migrating and you simply cannot.  At some point you will always have
to stop the device in order to complete the migration, and you cannot
stop it before you have stopped your page tracking mechanism.  So
unless the platform has an IOMMU that is somehow taking part in the
dirty page tracking you will not be able to stop the guest and then
the device, it will have to be the device and then the guest.

> Doing suspend and resume() may help to do migration easily but some
> devices requires low service down time. Especially network and I got
> that some cloud company promised less than 500ms network service downtime.

Honestly focusing on the downtime is getting the cart ahead of the
horse.  First you need to be able to do this without corrupting system
memory and regardless of the state of the device.  You haven't even
gotten to that state yet.  Last I knew the device had to be up in
order for your migration to even work.

Many devices are very state driven.  As such you cannot just freeze
them and restore them like you would regular device memory.  That is
where something like suspend/resume comes in because it already takes
care of getting the device ready for halt, and then resume.  Keep in
mind that those functions were meant to function on a device doing
something like a suspend to RAM or disk.  This is not too far of from
what a migration is doing since you need to halt the guest before you
move it.

As such the first step is to make it so that we can do the current
bonding approach with one change.  Specifically we want to leave the
device in the guest until the last portion of the migration instead of
having to remove it first.  To that end I would suggest focusing on
solving the DMA problem via something like a dma_mark_clean() type
solution as that would be one issue resolved and we all would see an
immediate gain instead of just those users of the ixgbevf driver.

> So I think performance effect also should be taken into account when we
> design the framework.

What you are proposing I would call premature optimization.  You need
to actually solve the problem before you can start optimizing things
and I don't see anything actually solved yet since your solution is
too unstable.

>>
>> What I am counter proposing seems like a very simple proposition.  It
>> can be implemented in two steps.
>>
>> 1.  Look at modifying dma_mark_clean().  It is a function called in
>> the sync and unmap paths of the lib/swiotlb.c.  If you could somehow
>> modify it to take care of marking the pages you unmap for Rx as being
>> dirty it will get you a good way towards your goal as it will allow
>> you to continue to do DMA while you are migrating the VM.
>>
>> 2.  Look at making use of the existing PCI suspend/resume calls that
>> are there to support PCI power management.  They have everything
>> needed to allow you to pause and resume DMA for the device before and
>> after the migration while retaining the driver state.  If you can
>> implement something that allows you to trigger these calls from the
>> PCI subsystem such as hot-plug then you would have a generic solution
>> that can be easily reproduced for multiple drivers beyond those
>> supported by ixgbevf.
>
>
> Glanced at PCI hotplug code. The hotplug events are triggered by PCI hotplug
> controller and these event are defined in the controller spec.
> It's hard to extend more events. Otherwise, we also need to add some
> specific codes in the PCI hotplug core since it's only add and remove
> PCI device when it gets events. It's also a challenge to modify Windows
> 

Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-29 Thread Lan, Tianyu

On 11/26/2015 11:56 AM, Alexander Duyck wrote:

> I am not saying you cannot modify the drivers, however what you are
doing is far too invasive.  Do you seriously plan on modifying all of
the PCI device drivers out there in order to allow any device that
might be direct assigned to a port to support migration?  I certainly
hope not.  That is why I have said that this solution will not scale.


Current drivers are not migration friendly. If the driver wants to
support migration, it's necessary to be changed.

RFC PATCH V1 presented our ideas about how to deal with MMIO, ring and
DMA tracking during migration. These are common for most drivers and
they maybe problematic in the previous version but can be corrected later.

Doing suspend and resume() may help to do migration easily but some
devices requires low service down time. Especially network and I got
that some cloud company promised less than 500ms network service downtime.

So I think performance effect also should be taken into account when we 
design the framework.





What I am counter proposing seems like a very simple proposition.  It
can be implemented in two steps.

1.  Look at modifying dma_mark_clean().  It is a function called in
the sync and unmap paths of the lib/swiotlb.c.  If you could somehow
modify it to take care of marking the pages you unmap for Rx as being
dirty it will get you a good way towards your goal as it will allow
you to continue to do DMA while you are migrating the VM.

2.  Look at making use of the existing PCI suspend/resume calls that
are there to support PCI power management.  They have everything
needed to allow you to pause and resume DMA for the device before and
after the migration while retaining the driver state.  If you can
implement something that allows you to trigger these calls from the
PCI subsystem such as hot-plug then you would have a generic solution
that can be easily reproduced for multiple drivers beyond those
supported by ixgbevf.


Glanced at PCI hotplug code. The hotplug events are triggered by PCI 
hotplug controller and these event are defined in the controller spec.
It's hard to extend more events. Otherwise, we also need to add some 
specific codes in the PCI hotplug core since it's only add and remove
PCI device when it gets events. It's also a challenge to modify Windows 
hotplug codes. So we may need to find another way.






Thanks.

- Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-25 Thread Alexander Duyck
On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu  wrote:
> On 2015年11月25日 13:30, Alexander Duyck wrote:
>> No, what I am getting at is that you can't go around and modify the
>> configuration space for every possible device out there.  This
>> solution won't scale.
>
>
> PCI config space regs are emulation by Qemu and so We can find the free
> PCI config space regs for the faked PCI capability. Its position can be
> not permanent.

Yes, but do you really want to edit every driver on every OS that you
plan to support this on.  What about things like direct assignment of
regular Ethernet ports?  What you really need is a solution that will
work generically on any existing piece of hardware out there.

>>  If you instead moved the logic for notifying
>> the device into a separate mechanism such as making it a part of the
>> hot-plug logic then you only have to write the code once per OS in
>> order to get the hot-plug capability to pause/resume the device.  What
>> I am talking about is not full hot-plug, but rather to extend the
>> existing hot-plug in Qemu and the Linux kernel to support a
>> "pause/resume" functionality.  The PCI hot-plug specification calls
>> out the option of implementing something like this, but we don't
>> currently have support for it.
>>
>
> Could you elaborate the part of PCI hot-plug specification you mentioned?
>
> My concern is whether it needs to change PCI spec or not.


In the PCI Hot-Plug Specification 1.1, in section 4.1.2 it states:
 In addition to quiescing add-in card activity, an operating-system
vendor may optionally implement a less drastic “pause” capability, in
anticipation of the same or a similar add-in card being reinserted.

The idea I had was basically if we were to implement something like
that in Linux then we could pause/resume the device instead of
outright removing it.  The pause functionality could make use of the
suspend/resume functionality most drivers already have for PCI power
management.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-25 Thread Alexander Duyck
On Wed, Nov 25, 2015 at 7:15 PM, Dong, Eddie  wrote:
>> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu  wrote:
>> > On 2015年11月25日 13:30, Alexander Duyck wrote:
>> >> No, what I am getting at is that you can't go around and modify the
>> >> configuration space for every possible device out there.  This
>> >> solution won't scale.
>> >
>> >
>> > PCI config space regs are emulation by Qemu and so We can find the
>> > free PCI config space regs for the faked PCI capability. Its position
>> > can be not permanent.
>>
>> Yes, but do you really want to edit every driver on every OS that you plan to
>> support this on.  What about things like direct assignment of regular 
>> Ethernet
>> ports?  What you really need is a solution that will work generically on any
>> existing piece of hardware out there.
>
> The fundamental assumption of this patch series is to modify the driver in 
> guest to self-emulate or track the device state, so that the migration may be 
> possible.
> I don't think we can modify OS, without modifying the drivers, even using the 
> PCIe hotplug mechanism.
> In the meantime, modifying Windows OS is a big challenge given that only 
> Microsoft can do. While, modifying driver is relatively simple and manageable 
> to device vendors, if the device vendor want to support state-clone based 
> migration.

The problem is the code you are presenting, even as a proof of concept
is seriously flawed.  It does a poor job of exposing how any of this
can be duplicated for any other VF other than the one you are working
on.

I am not saying you cannot modify the drivers, however what you are
doing is far too invasive.  Do you seriously plan on modifying all of
the PCI device drivers out there in order to allow any device that
might be direct assigned to a port to support migration?  I certainly
hope not.  That is why I have said that this solution will not scale.

What I am counter proposing seems like a very simple proposition.  It
can be implemented in two steps.

1.  Look at modifying dma_mark_clean().  It is a function called in
the sync and unmap paths of the lib/swiotlb.c.  If you could somehow
modify it to take care of marking the pages you unmap for Rx as being
dirty it will get you a good way towards your goal as it will allow
you to continue to do DMA while you are migrating the VM.

2.  Look at making use of the existing PCI suspend/resume calls that
are there to support PCI power management.  They have everything
needed to allow you to pause and resume DMA for the device before and
after the migration while retaining the driver state.  If you can
implement something that allows you to trigger these calls from the
PCI subsystem such as hot-plug then you would have a generic solution
that can be easily reproduced for multiple drivers beyond those
supported by ixgbevf.

Thanks.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-25 Thread Dong, Eddie
> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu  wrote:
> > On 2015年11月25日 13:30, Alexander Duyck wrote:
> >> No, what I am getting at is that you can't go around and modify the
> >> configuration space for every possible device out there.  This
> >> solution won't scale.
> >
> >
> > PCI config space regs are emulation by Qemu and so We can find the
> > free PCI config space regs for the faked PCI capability. Its position
> > can be not permanent.
> 
> Yes, but do you really want to edit every driver on every OS that you plan to
> support this on.  What about things like direct assignment of regular Ethernet
> ports?  What you really need is a solution that will work generically on any
> existing piece of hardware out there.

The fundamental assumption of this patch series is to modify the driver in 
guest to self-emulate or track the device state, so that the migration may be 
possible.
I don't think we can modify OS, without modifying the drivers, even using the 
PCIe hotplug mechanism.  
In the meantime, modifying Windows OS is a big challenge given that only 
Microsoft can do. While, modifying driver is relatively simple and manageable 
to device vendors, if the device vendor want to support state-clone based 
migration.

Thx Eddie


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-25 Thread Lan Tianyu
On 2015年11月25日 13:30, Alexander Duyck wrote:
> No, what I am getting at is that you can't go around and modify the
> configuration space for every possible device out there.  This
> solution won't scale.


PCI config space regs are emulation by Qemu and so We can find the free
PCI config space regs for the faked PCI capability. Its position can be
not permanent.


>  If you instead moved the logic for notifying
> the device into a separate mechanism such as making it a part of the
> hot-plug logic then you only have to write the code once per OS in
> order to get the hot-plug capability to pause/resume the device.  What
> I am talking about is not full hot-plug, but rather to extend the
> existing hot-plug in Qemu and the Linux kernel to support a
> "pause/resume" functionality.  The PCI hot-plug specification calls
> out the option of implementing something like this, but we don't
> currently have support for it.
>

Could you elaborate the part of PCI hot-plug specification you mentioned?

My concern is whether it needs to change PCI spec or not.



> I just feel doing it through PCI hot-plug messages will scale much
> better as you could likely make use of the power management
> suspend/resume calls to take care of most of the needed implementation
> details.
> 
> - Alex
-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Lan Tianyu
On 2015年11月24日 22:20, Alexander Duyck wrote:
> I'm still not a fan of this approach.  I really feel like this is
> something that should be resolved by extending the existing PCI hot-plug
> rather than trying to instrument this per driver.  Then you will get the
> goodness for multiple drivers and multiple OSes instead of just one.  An
> added advantage to dealing with this in the PCI hot-plug environment
> would be that you could then still do a hot-plug even if the guest
> didn't load a driver for the VF since you would be working with the PCI
> slot instead of the device itself.
> 
> - Alex

Hi Alex:
What's you mentioned seems the bonding driver solution.
Paper "Live Migration with Pass-through Device for Linux VM" describes
it. It does VF hotplug during migration. In order to maintain Network
connection when VF is out, it takes advantage of Linux bonding driver to
switch between VF NIC and emulated NIC. But the side affects, that
requires VM to do additional configure and the performance during
switching two NIC is not good.

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Alexander Duyck
On Tue, Nov 24, 2015 at 7:18 PM, Lan Tianyu  wrote:
> On 2015年11月24日 22:20, Alexander Duyck wrote:
>> I'm still not a fan of this approach.  I really feel like this is
>> something that should be resolved by extending the existing PCI hot-plug
>> rather than trying to instrument this per driver.  Then you will get the
>> goodness for multiple drivers and multiple OSes instead of just one.  An
>> added advantage to dealing with this in the PCI hot-plug environment
>> would be that you could then still do a hot-plug even if the guest
>> didn't load a driver for the VF since you would be working with the PCI
>> slot instead of the device itself.
>>
>> - Alex
>
> Hi Alex:
> What's you mentioned seems the bonding driver solution.
> Paper "Live Migration with Pass-through Device for Linux VM" describes
> it. It does VF hotplug during migration. In order to maintain Network
> connection when VF is out, it takes advantage of Linux bonding driver to
> switch between VF NIC and emulated NIC. But the side affects, that
> requires VM to do additional configure and the performance during
> switching two NIC is not good.

No, what I am getting at is that you can't go around and modify the
configuration space for every possible device out there.  This
solution won't scale.  If you instead moved the logic for notifying
the device into a separate mechanism such as making it a part of the
hot-plug logic then you only have to write the code once per OS in
order to get the hot-plug capability to pause/resume the device.  What
I am talking about is not full hot-plug, but rather to extend the
existing hot-plug in Qemu and the Linux kernel to support a
"pause/resume" functionality.  The PCI hot-plug specification calls
out the option of implementing something like this, but we don't
currently have support for it.

I just feel doing it through PCI hot-plug messages will scale much
better as you could likely make use of the power management
suspend/resume calls to take care of most of the needed implementation
details.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC

2015-11-24 Thread Alexander Duyck

On 11/24/2015 05:38 AM, Lan Tianyu wrote:

This patchset is to propose a solution of adding live migration
support for SRIOV NIC.

During migration, Qemu needs to let VF driver in the VM to know
migration start and end. Qemu adds faked PCI migration capability
to help to sync status between two sides during migration.

Qemu triggers VF's mailbox irq via sending MSIX msg when migration
status is changed. VF driver tells Qemu its mailbox vector index
via the new PCI capability. In some cases(NIC is suspended or closed),
VF mailbox irq is freed and VF driver can disable irq injecting via
new capability.

VF driver will put down nic before migration and put up again on
the target machine.

Lan Tianyu (3):
   VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO
   PCI: Add macros for faked PCI migration capability
   Ixgbevf: Add migration support for ixgbevf driver

  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |   5 ++
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++
  drivers/vfio/pci/vfio_pci.c   |  21 +
  drivers/vfio/pci/vfio_pci_config.c|  38 ++--
  drivers/vfio/pci/vfio_pci_private.h   |   5 ++
  include/uapi/linux/pci_regs.h |  18 +++-
  include/uapi/linux/vfio.h |  12 +++
  7 files changed, 194 insertions(+), 7 deletions(-)


I'm still not a fan of this approach.  I really feel like this is 
something that should be resolved by extending the existing PCI hot-plug 
rather than trying to instrument this per driver.  Then you will get the 
goodness for multiple drivers and multiple OSes instead of just one.  An 
added advantage to dealing with this in the PCI hot-plug environment 
would be that you could then still do a hot-plug even if the guest 
didn't load a driver for the VF since you would be working with the PCI 
slot instead of the device itself.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html