On Wed, Nov 30, 2016 at 04:20:03AM +0000, Rakesh Ranjan wrote: > >>>>> Why does the client have to know about failover if it's connected to > >>>>>a server process on the same host? I thought the server process > >>>>>manages networking issues (like the actual protocol to speak to other > >>>>>VxHS nodes and for failover). > > Just to comment on this, the model being followed within HyperScale is to > allow application I/O continuity (resiliency) in various cases as > mentioned below. It really adds value for consumer/customer and tries to > avoid culprits for single points of failure. > > 1. HyperScale storage service failure (QNIO Server) > - Daemon managing local storage for VMs and runs on each compute node > - Daemon can run as a service on Hypervisor itself as well as within VSA > (Virtual Storage Appliance or Virtual Machine running on the hypervisor), > which depends on ecosystem where HyperScale is supported > - Daemon or storage service down/crash/crash-in-loop shouldn¹t lead to > an > huge impact on all the VMs running on that hypervisor or compute node > hence providing service level resiliency is very useful for > application I/O continuity in such case. > > Solution: > - The service failure handling can be only done at the client side and > not at the server side since service running as a server itself is down. > - Client detects an I/O error and depending on the logic, it does > application I/O failover to another available/active QNIO server or > HyperScale Storage service running on different compute node > (reflection/replication node) > - Once the orig/old server comes back online, client gets/receives > negotiated error (not a real application error) to do the application I/O > failback to the original server or local HyperScale storage service to get > better I/O performance. > > 2. Local physical storage or media failure > - Once server or HyperScale storage service detects the media or local > disk failure, depending on the vDisk (guest disk) configuration, if > another storage copy is available > on different compute node then it internally handles the local > fault and serves the application read and write requests otherwise > application or client gets the fault. > - Client doesn¹t know about any I/O failure since Server or Storage > service manages/handles the fault tolerance. > - In such case, in order to get some I/O performance benefit, once > client gets a negotiated error (not an application error) from local > server or storage service, > client can initiate I/O failover and can directly send > application I/O to another compute node where storage copy is available to > serve the application need instead of sending it locally where media is > faulted.
Thanks for explaining the model. The new information for me here is that the qnio server may run in a VM instead of on the host and that the client will attempt to use a remote qnio server if the local qnio server fails. This means that although the discussion most recently focussed on local I/O tap performance, there is a requirement for a network protocol too. The local I/O tap stuff is just an optimization for when the local qnio server can be used. Stefan
Description: PGP signature