[ovirt-devel] Re: Improving VM behavior in case of IO errors

2020-08-10 Thread Shubha Kulkarni

Thanks Nir for the response.

I have same question as Kevin. It will be great if you can share your 
thoughts on that.


Thanks,

Shubha

On 8/10/2020 4:53 AM, Kevin Wolf wrote:

Am 09.08.2020 um 23:50 hat Nir Soffer geschrieben:

On Wed, Jul 29, 2020 at 2:30 PM Shubha Kulkarni
 wrote:

Thanks for the feedback Nir.

I agree in general that having an additional engine config for disk
level error handling default would be the right way. It would be good to
decide the granularity. Would it make sense to have this for a specific
disk type like lun or would you prefer to make it generic for all types?

This must be for a specific disk type, since for thin images on block
storage we cannot support propagating errors to the guest. This will
break thin provisioning.

Is werror=enospc not enough for thin provisioning to work? This will
still stop the guest for any other kinds of I/O errors.

Kevin


Handling the LUN use case first seems like the best way, since in this case we
don't manage the LUN and we don't support resuming paused using LUNs yet,
so propagating the error may be more useful.

Managed Block Storage (cinderlib based disks) are very much like
direct LUN. In this
case we do manage the disks on the server, but otherwise we don't
support anything
on the host (e.g. monitoring, resuming paused VMs) so propagating the error like
direct LUNs may be more useful.

Images are a bigger problem since thin disks cannot support
propagating errors but
preallocated disks can. But once you create a snapshot prealocated disks behave
exactly like thin disks because they are the same.

Snapshots are also created automatically in for preallocated images,
for example during
live storage migration, and deleted automatically after the migration.
So you cannot
assume that having only preallocated disks is good for propagating errors.

Even if you limit this option to file based storage, this is going to
break when you migrate
the disks to block storage.

Nir


Thanks,

Shubha

On 7/28/2020 2:03 PM, Nir Soffer wrote:

On Tue, Jul 28, 2020 at 4:58 AM Shubha Kulkarni
 wrote:

Hello,

In OVirt, we have a property propagate_error at the disk level that
decides in case of an error, how this error be propagated to the VM.
This value is maintained in the database table with the default value
set as Off. The default setting(Off) results in a policy that ends up
pausing the VM rather than propagating the errors to VM.  There is no
provision in the UI currently to configure this property for disk
(images or luns). So there is no easy way to set this value.  Further,
even if the value is manually set to "On" in db, it gets overwriiten by
UI everytime some other property is updated as described here -
https://bugzilla.redhat.com/show_bug.cgi?id=1669367

Setting the value to "Off" is not ideal for multipath devices where a
single path failure causes vm to pause.

Single path failure should be transparent to qemu. multipath will fail over
the I/O to another path. The I/O will fail only if all paths are down, and
(with the default configuration), multipath path checkers failed 4 times.


It puts serious restrictions for
the DR situation and unlike VMWare * Hyper-V, oVirt is not able to
support the DR functionality -
https://bugzilla.redhat.com/show_bug.cgi?id=1314160

Alghouth in this bug we see that failover that looks successful from multipath
and vdsm point of view ended in paused VM:
https://bugzilla.redhat.com/1860377

Maybe Ben can explain how this can happen.

I hope that qemu will provide more info on errors in the future. If we had a log
about the failure I/O it could be helpful.


While we wait for RFE, the proposal here is to revise the out of the box
behavior for LUNs. For LUNs, we should propagate the errors to VM rather
than directly stopping those. This will allow us to handle short-term
multipath outages and improve availability. This is a simple change in
behavior but will have good positive impact. I would like to seek
feedback about this to make sure that everyone is ok with the proposal.

I think it makes sense, but this is just a default, and it cannot work
for all cases.

This can end in broken VM with read only file system that must be
rebooted, while
with error_policy="stop", failover may be transparent to the VM even
if it was paused
for a short time.

I would start by making engine defaults configurable using engine
config, so different
oVirt distributions can use different defaults.

Nir


___
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/X2XTNSLOBW5SX3XQUJFQKAMK4R2D6IOB/


[ovirt-devel] Re: Improving VM behavior in case of IO errors

2020-08-10 Thread Kevin Wolf
Am 09.08.2020 um 23:50 hat Nir Soffer geschrieben:
> On Wed, Jul 29, 2020 at 2:30 PM Shubha Kulkarni
>  wrote:
> >
> > Thanks for the feedback Nir.
> >
> > I agree in general that having an additional engine config for disk
> > level error handling default would be the right way. It would be good to
> > decide the granularity. Would it make sense to have this for a specific
> > disk type like lun or would you prefer to make it generic for all types?
> 
> This must be for a specific disk type, since for thin images on block
> storage we cannot support propagating errors to the guest. This will
> break thin provisioning.

Is werror=enospc not enough for thin provisioning to work? This will
still stop the guest for any other kinds of I/O errors.

Kevin

> Handling the LUN use case first seems like the best way, since in this case we
> don't manage the LUN and we don't support resuming paused using LUNs yet,
> so propagating the error may be more useful.
> 
> Managed Block Storage (cinderlib based disks) are very much like
> direct LUN. In this
> case we do manage the disks on the server, but otherwise we don't
> support anything
> on the host (e.g. monitoring, resuming paused VMs) so propagating the error 
> like
> direct LUNs may be more useful.
> 
> Images are a bigger problem since thin disks cannot support
> propagating errors but
> preallocated disks can. But once you create a snapshot prealocated disks 
> behave
> exactly like thin disks because they are the same.
> 
> Snapshots are also created automatically in for preallocated images,
> for example during
> live storage migration, and deleted automatically after the migration.
> So you cannot
> assume that having only preallocated disks is good for propagating errors.
> 
> Even if you limit this option to file based storage, this is going to
> break when you migrate
> the disks to block storage.
> 
> Nir
> 
> >
> > Thanks,
> >
> > Shubha
> >
> > On 7/28/2020 2:03 PM, Nir Soffer wrote:
> > > On Tue, Jul 28, 2020 at 4:58 AM Shubha Kulkarni
> > >  wrote:
> > >> Hello,
> > >>
> > >> In OVirt, we have a property propagate_error at the disk level that
> > >> decides in case of an error, how this error be propagated to the VM.
> > >> This value is maintained in the database table with the default value
> > >> set as Off. The default setting(Off) results in a policy that ends up
> > >> pausing the VM rather than propagating the errors to VM.  There is no
> > >> provision in the UI currently to configure this property for disk
> > >> (images or luns). So there is no easy way to set this value.  Further,
> > >> even if the value is manually set to "On" in db, it gets overwriiten by
> > >> UI everytime some other property is updated as described here -
> > >> https://bugzilla.redhat.com/show_bug.cgi?id=1669367
> > >>
> > >> Setting the value to "Off" is not ideal for multipath devices where a
> > >> single path failure causes vm to pause.
> > > Single path failure should be transparent to qemu. multipath will fail 
> > > over
> > > the I/O to another path. The I/O will fail only if all paths are down, and
> > > (with the default configuration), multipath path checkers failed 4 times.
> > >
> > >> It puts serious restrictions for
> > >> the DR situation and unlike VMWare * Hyper-V, oVirt is not able to
> > >> support the DR functionality -
> > >> https://bugzilla.redhat.com/show_bug.cgi?id=1314160
> > > Alghouth in this bug we see that failover that looks successful from 
> > > multipath
> > > and vdsm point of view ended in paused VM:
> > > https://bugzilla.redhat.com/1860377
> > >
> > > Maybe Ben can explain how this can happen.
> > >
> > > I hope that qemu will provide more info on errors in the future. If we 
> > > had a log
> > > about the failure I/O it could be helpful.
> > >
> > >> While we wait for RFE, the proposal here is to revise the out of the box
> > >> behavior for LUNs. For LUNs, we should propagate the errors to VM rather
> > >> than directly stopping those. This will allow us to handle short-term
> > >> multipath outages and improve availability. This is a simple change in
> > >> behavior but will have good positive impact. I would like to seek
> > >> feedback about this to make sure that everyone is ok with the proposal.
> > > I think it makes sense, but this is just a default, and it cannot work
> > > for all cases.
> > >
> > > This can end in broken VM with read only file system that must be
> > > rebooted, while
> > > with error_policy="stop", failover may be transparent to the VM even
> > > if it was paused
> > > for a short time.
> > >
> > > I would start by making engine defaults configurable using engine
> > > config, so different
> > > oVirt distributions can use different defaults.
> > >
> > > Nir
> > >
> >
> 
___
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 

[ovirt-devel] Re: device compatibility interface for live migration with assigned devices

2020-08-10 Thread Yan Zhao
On Wed, Aug 05, 2020 at 12:53:19PM +0200, Jiri Pirko wrote:
> Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.z...@intel.com wrote:
> >On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote:
> >> 
> >> On 2020/8/5 下午3:56, Jiri Pirko wrote:
> >> > Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote:
> >> > > On 2020/8/5 上午10:16, Yan Zhao wrote:
> >> > > > On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
> >> > > > > On 2020/8/5 上午12:35, Cornelia Huck wrote:
> >> > > > > > [sorry about not chiming in earlier]
> >> > > > > > 
> >> > > > > > On Wed, 29 Jul 2020 16:05:03 +0800
> >> > > > > > Yan Zhao  wrote:
> >> > > > > > 
> >> > > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson 
> >> > > > > > > wrote:
> >> > > > > > (...)
> >> > > > > > 
> >> > > > > > > > Based on the feedback we've received, the previously 
> >> > > > > > > > proposed interface
> >> > > > > > > > is not viable.  I think there's agreement that the user 
> >> > > > > > > > needs to be
> >> > > > > > > > able to parse and interpret the version information.  Using 
> >> > > > > > > > json seems
> >> > > > > > > > viable, but I don't know if it's the best option.  Is there 
> >> > > > > > > > any
> >> > > > > > > > precedent of markup strings returned via sysfs we could 
> >> > > > > > > > follow?
> >> > > > > > I don't think encoding complex information in a sysfs file is a 
> >> > > > > > viable
> >> > > > > > approach. Quoting Documentation/filesystems/sysfs.rst:
> >> > > > > > 
> >> > > > > > "Attributes should be ASCII text files, preferably with only one 
> >> > > > > > value
> >> > > > > > per file. It is noted that it may not be efficient to contain 
> >> > > > > > only one
> >> > > > > > value per file, so it is socially acceptable to express an array 
> >> > > > > > of
> >> > > > > > values of the same type.
> >> > > > > > Mixing types, expressing multiple lines of data, and doing fancy
> >> > > > > > formatting of data is heavily frowned upon."
> >> > > > > > 
> >> > > > > > Even though this is an older file, I think these restrictions 
> >> > > > > > still
> >> > > > > > apply.
> >> > > > > +1, that's another reason why devlink(netlink) is better.
> >> > > > > 
> >> > > > hi Jason,
> >> > > > do you have any materials or sample code about devlink, so we can 
> >> > > > have a good
> >> > > > study of it?
> >> > > > I found some kernel docs about it but my preliminary study didn't 
> >> > > > show me the
> >> > > > advantage of devlink.
> >> > > 
> >> > > CC Jiri and Parav for a better answer for this.
> >> > > 
> >> > > My understanding is that the following advantages are obvious (as I 
> >> > > replied
> >> > > in another thread):
> >> > > 
> >> > > - existing users (NIC, crypto, SCSI, ib), mature and stable
> >> > > - much better error reporting (ext_ack other than string or errno)
> >> > > - namespace aware
> >> > > - do not couple with kobject
> >> > Jason, what is your use case?
> >> 
> >> 
> >> I think the use case is to report device compatibility for live migration.
> >> Yan proposed a simple sysfs based migration version first, but it looks not
> >> sufficient and something based on JSON is discussed.
> >> 
> >> Yan, can you help to summarize the discussion so far for Jiri as a
> >> reference?
> >> 
> >yes.
> >we are currently defining an device live migration compatibility
> >interface in order to let user space like openstack and libvirt knows
> >which two devices are live migration compatible.
> >currently the devices include mdev (a kernel emulated virtual device)
> >and physical devices (e.g.  a VF of a PCI SRIOV device).
> >
> >the attributes we want user space to compare including
> >common attribues:
> >device_api: vfio-pci, vfio-ccw...
> >mdev_type: mdev type of mdev or similar signature for physical device
> >   It specifies a device's hardware capability. e.g.
> >i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics
> >device.
> >software_version: device driver's version.
> >   in .[.bugfix] scheme, where there is no
> >compatibility across major versions, minor versions have
> >forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and
> >bugfix version number indicates some degree of internal
> >improvement that is not visible to the user in terms of
> >features or compatibility,
> >
> >vendor specific attributes: each vendor may define different attributes
> >   device id : device id of a physical devices or mdev's parent pci device.
> >   it could be equal to pci id for pci devices
> >   aggregator: used together with mdev_type. e.g. aggregator=2 together
> >   with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel
> >graphics device.
> >   remote_url: for a local NVMe VF, it may be configured with a remote
> >   url of a remote storage and all data is stored in the
> >remote side 

[ovirt-devel] Re: device compatibility interface for live migration with assigned devices

2020-08-10 Thread Cornelia Huck
On Wed, 05 Aug 2020 12:35:01 +0100
Sean Mooney  wrote:

> On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote:
> > Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.z...@intel.com wrote:  

(...)

> > >software_version: device driver's version.
> > >   in .[.bugfix] scheme, where there is no
> > >  compatibility across major versions, minor versions have
> > >  forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and
> > >  bugfix version number indicates some degree of internal
> > >  improvement that is not visible to the user in terms of
> > >  features or compatibility,
> > > 
> > > vendor specific attributes: each vendor may define different attributes
> > >   device id : device id of a physical devices or mdev's parent pci device.
> > >   it could be equal to pci id for pci devices
> > >   aggregator: used together with mdev_type. e.g. aggregator=2 together
> > >   with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel
> > >  graphics device.
> > >   remote_url: for a local NVMe VF, it may be configured with a remote
> > >   url of a remote storage and all data is stored in the
> > >  remote side specified by the remote url.
> > >   ...  
> just a minor not that i find ^ much more simmple to understand then
> the current proposal with self and compatiable.
> if i have well defiend attibute that i can parse and understand that allow
> me to calulate the what is and is not compatible that is likely going to
> more useful as you wont have to keep maintianing a list of other compatible
> devices every time a new sku is released.
> 
> in anycase thank for actully shareing ^ as it make it simpler to reson about 
> what
> you have previously proposed.

So, what would be the most helpful format? A 'software_version' field
that follows the conventions outlined above, and other (possibly
optional) fields that have to match?

(...)

> > Thanks for the explanation, I'm still fuzzy about the details.
> > Anyway, I suggest you to check "devlink dev info" command we have
> > implemented for multiple drivers.  
> 
> is devlink exposed as a filesytem we can read with just open?
> openstack will likely try to leverage libvirt to get this info but when we
> cant its much simpler to read sysfs then it is to take a a depenency on a 
> commandline
> too and have to fork shell to execute it and parse the cli output.
> pyroute2 which we use in some openstack poject has basic python binding for 
> devlink but im not
> sure how complete it is as i think its relitivly new addtion. if we need to 
> take a dependcy
> we will but that would be a drawback fo devlink not that that is a large one 
> just something
> to keep in mind.

A devlinkfs, maybe? At least for reading information (IIUC, "devlink
dev info" is only about information retrieval, right?)
___
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/X75UR2VMBOZPESPFRBT5YVYUKUMZT2Q4/