On Mon, Mar 7, 2016 at 12:33 AM, Tim Bell <tim.b...@cern.ch> wrote: > From: joe <j...@topjian.net> > Date: Monday 7 March 2016 at 07:53 > To: openstack-operators <openstack-operators@lists.openstack.org> > Subject: Re: [Openstack-operators] RAID / stripe block storage volumes > > We ($work) have been researching this topic for the past few weeks and I > wanted to give an update on what we've found. > > First, we've found that both Rackspace and Azure advocate the use of > RAID'ing block storage volumes from within an instance for both performance > and resilience [1][2][3]. I only mention this to add to the earlier Amazon > AWS information and not to imply that more people should share this view. > > Second, we discovered virtio-scsi [4]. By adding the following properties > to an image, the disks will now appear as SCSI disks, including the more > common /dev/sdx naming: > > hw_disk_bus_model=virtio-scsi > hw_scsi_model=virtio-scsi > hw_disk_bus=scsi > > What's notable is that, in our testing, ZFS pools and Gluster replicas are > more likely to see the volume disconnect/fail with virtio-scsi. mdadm has > always been fairly dependable, so there hasn't been a change there. We're > still testing, but virtio-scsi looks promising. > > > We found significantly slower (~20%) from the virtio SCSI on bonnie++. I > had been thinking it would be better. > > What were your performance experiences ? > > Tim >
That's one area we're still testing. We're seeing a 15% increase in reads for 4k - 1m blocks but anywhere from 3-20% decrease in all types of writing activity. Something seems off... or at least that there should be a reason. > > 1: > https://support.rackspace.com/how-to/configuring-a-software-raid-on-a-linux-general-purpose-cloud-server/ > 2: https://support.rackspace.com/how-to/cloud-block-storage-faq/ > 3: > https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-configure-raid/ > 4: https://wiki.openstack.org/wiki/LibvirtVirtioScsi > > On Mon, Feb 8, 2016 at 7:18 PM, Joe Topjian <j...@topjian.net> wrote: > >> Yep. Don't get me wrong -- I agree 100% with everything you've said >> throughout this thread. Applications that have native replication are >> awesome. Swift is crazy awesome. :) >> >> I understand that some may see the use of mdadm, Cinder-assisted >> replication, etc as supporting "pet" environments, and I agree to some >> extent. But I do think there are applicable use-cases where those services >> could be very helpful. >> >> As one example, I know of large cloud-based environments which handle >> very large data sets and are entirely stood up through configuration >> management systems. However, due to the sheer size of data being handled, >> rebuilding or resyncing a portion of the environment could take hours. >> Failing over to a replicated volume is instant.In addition, being able to >> both stripe and replicate goes a very long way in making the most out of >> commodity block storage environments (for example, avoiding packing >> problems and such). >> >> Should these types of applications be reading / writing directly to >> Swift, HDFS, or handling replication themselves? Sure, in a perfect world. >> Does Gluster fill all gaps I've mentioned? Kind of. >> >> I guess I'm just trying to survey the options available for applications >> and environments that would otherwise be very flexible and resilient if it >> wasn't for their awkward use of storage. :) >> >> On Mon, Feb 8, 2016 at 6:18 PM, Robert Starmer <rob...@kumul.us> wrote: >> >>> Besides, wouldn't it be better to actually do application layer backup >>> restore, or application level distribution for replication? That >>> architecture at least let's the application determine and deal with corrupt >>> data transmission rather than the DRBD like model where you corrupt one >>> data-set, you corrupt them all... >>> >>> Hence my comment about having some form of object storage (SWIFT is >>> perhaps even a good example of this architeccture, the proxy replicates, >>> checks MD5, etc. to verify good data, rather than just replicating blocks >>> of data). >>> >>> >>> >>> On Mon, Feb 8, 2016 at 7:15 PM, Robert Starmer <rob...@kumul.us> wrote: >>> >>>> I have not run into anyone replicating volumes or creating redundancy >>>> at the VM level (beyond, as you point out, HDFS, etc.). >>>> >>>> R >>>> >>>> On Mon, Feb 8, 2016 at 6:54 PM, Joe Topjian <j...@topjian.net> wrote: >>>> >>>>> This is a great conversation and I really appreciate everyone's input. >>>>> Though, I agree, we wandered off the original question and that's my fault >>>>> for mentioning various storage backends. >>>>> >>>>> For the sake of conversation, let's just say the user has no knowledge >>>>> of the underlying storage technology. They're presented with a Block >>>>> Storage service and the rest is up to them. What known, working options >>>>> does the user have to build their own block storage resilience? (Ignoring >>>>> "obvious" solutions where the application has native replication, such as >>>>> Galera, elasticsearch, etc) >>>>> >>>>> I have seen references to Cinder supporting replication, but I'm not >>>>> able to find a lot of information about it. The support matrix[1] lists >>>>> very few drivers that actually implement replication -- is this true or is >>>>> there a trove of replication docs that I just haven't been able to find? >>>>> >>>>> Amazon AWS publishes instructions on how to use mdadm with EBS[2]. One >>>>> might interpret that to mean mdadm is a supported solution within EC2 >>>>> based >>>>> instances. >>>>> >>>>> There are also references to DRBD and EC2, though I could not find >>>>> anything as "official" as mdadm and EC2. >>>>> >>>>> Does anyone have experience (or know users) doing either? >>>>> (specifically with libvirt/KVM, but I'd be curious to know in general) >>>>> >>>>> Or is it more advisable to create multiple instances where data is >>>>> replicated instance-to-instance rather than a single instance with >>>>> multiple >>>>> volumes and have data replicated volume-to-volume (by way of a single >>>>> instance)? And if so, why? Is a lack of stable volume-to-volume >>>>> replication >>>>> a limitation of certain hypervisors? >>>>> >>>>> Or has this area just not been explored in depth within OpenStack >>>>> environments yet? >>>>> >>>>> 1: https://wiki.openstack.org/wiki/CinderSupportMatrix >>>>> 2: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/raid-config.html >>>>> >>>>> >>>>> On Mon, Feb 8, 2016 at 4:10 PM, Robert Starmer <rob...@kumul.us> >>>>> wrote: >>>>> >>>>>> I'm not against Ceph, but even 2 machines (and really 2 machines with >>>>>> enough storage to be meaningful, e.g. not the all blade environments I've >>>>>> built some o7k systems on) may not be available for storage, so there >>>>>> are >>>>>> cases where that's not necessarily the solution. I built resiliency in >>>>>> one >>>>>> environment with a 2 node controller/Glance/db system with Gluster, which >>>>>> enabled enough middleware resiliency to meet the customers recovery >>>>>> expectations. Regardless, even with a cattle application model, the >>>>>> infrastructure middleware still needs to be able to provide some level of >>>>>> resiliency. >>>>>> >>>>>> But we've kind-of wandered off of the original question. I think that >>>>>> to bring this back on topic, I think users can build resilience in their >>>>>> own storage construction, but I still think there are use cases where the >>>>>> middleware either needs to use it's own resiliency layer, and/or may end >>>>>> up >>>>>> providing it for the end user. >>>>>> >>>>>> R >>>>>> >>>>>> On Mon, Feb 8, 2016 at 3:51 PM, Fox, Kevin M <kevin....@pnnl.gov> >>>>>> wrote: >>>>>> >>>>>>> We've used ceph to address the storage requirement in small clouds >>>>>>> pretty well. it works pretty well with only two storage nodes with >>>>>>> replication set to 2, and because of the radosgw, you can share your >>>>>>> small >>>>>>> amount of storage between the object store and the block store avoiding >>>>>>> the >>>>>>> need to overprovision swift-only or cinder-only to handle usage >>>>>>> unknowns. >>>>>>> Its just one pool of storage. >>>>>>> >>>>>>> Your right, using lvm is like telling your users, don't do pets, but >>>>>>> then having pets at the heart of your system. when you loose one, you >>>>>>> loose >>>>>>> a lot. With a small ceph, you can take out one of the nodes, burn it to >>>>>>> the >>>>>>> ground and put it back, and it just works. No pets. >>>>>>> >>>>>>> Do consider ceph for the small use case. >>>>>>> >>>>>>> Thanks, >>>>>>> Kevin >>>>>>> >>>>>>> ------------------------------ >>>>>>> *From:* Robert Starmer [rob...@kumul.us] >>>>>>> *Sent:* Monday, February 08, 2016 1:30 PM >>>>>>> *To:* Ned Rhudy >>>>>>> *Cc:* OpenStack Operators >>>>>>> >>>>>>> *Subject:* Re: [Openstack-operators] RAID / stripe block storage >>>>>>> volumes >>>>>>> >>>>>>> Ned's model is the model I meant by "multiple underlying storage >>>>>>> services". Most of the systems I've built are LV/LVM only, a few added >>>>>>> Ceph as an alternative/live-migration option, and one where we used >>>>>>> Gluster >>>>>>> due to size. Note that the environments I have worked with in general >>>>>>> are >>>>>>> small (~20 compute), so huge Ceph environments aren't common. I am also >>>>>>> working on a project where the storage backend is entirely NFS... >>>>>>> >>>>>>> And I think users are more and more educated to assume that there is >>>>>>> nothing guaranteed. There is the realization, at least for a good set >>>>>>> of >>>>>>> the customers I've worked with (and I try to educate the non-believers), >>>>>>> that the way you get best effect from a system like OpenStack is to >>>>>>> consider everything disposable. The one gap I've seen is that there are >>>>>>> plenty of folks who don't deploy SWIFT, and without some form of object >>>>>>> store, there's still the question of where you place your datasets so >>>>>>> that >>>>>>> they can be quickly recovered (and how do you keep them up to date if >>>>>>> you >>>>>>> do have one). With VMs, there's the concept that you can recover >>>>>>> quickly >>>>>>> because the "dataset" e.g. your OS, is already there for you, and in >>>>>>> plenty >>>>>>> of small environments, that's only as true as the glance repository >>>>>>> (guess >>>>>>> what's usually backing that when there's no SWIFT around...). >>>>>>> >>>>>>> So I see the issue as a holistic one. How do you show >>>>>>> operators/users that they should consider everything disposable if we >>>>>>> only >>>>>>> look at the current running instance as the "thing" Somewhere you >>>>>>> still >>>>>>> likely need some form of distributed resilience (and yes, I can see >>>>>>> using >>>>>>> the distributed Canonical, Centos, RedHat, Fedora, Debian, etc. mirrors >>>>>>> as >>>>>>> your distributed Image backup but what about the database content, >>>>>>> etc.). >>>>>>> >>>>>>> Robert >>>>>>> >>>>>>> On Mon, Feb 8, 2016 at 1:44 PM, Ned Rhudy (BLOOMBERG/ 731 LEX) < >>>>>>> erh...@bloomberg.net> wrote: >>>>>>> >>>>>>>> In our environments, we offer two types of storage. Tenants can >>>>>>>> either use Ceph/RBD and trade speed/latency for reliability and >>>>>>>> protection >>>>>>>> against physical disk failures, or they can launch instances that are >>>>>>>> realized as LVs on an LVM VG that we create on top of a RAID 0 >>>>>>>> spanning all >>>>>>>> but the OS disk on the hypervisor. This lets the users elect to go >>>>>>>> all-in >>>>>>>> on speed and sacrifice reliability for applications where >>>>>>>> replication/HA is >>>>>>>> handled at the app level, if the data on the instance is sourced from >>>>>>>> elsewhere, or if they just don't care much about the data. >>>>>>>> >>>>>>>> There are some further changes to our approach that we would like >>>>>>>> to make down the road, but in general our users seem to like the >>>>>>>> current >>>>>>>> system and being able to forgo reliability or speed as their >>>>>>>> circumstances >>>>>>>> demand. >>>>>>>> >>>>>>>> From: j...@topjian.net >>>>>>>> Subject: Re: [Openstack-operators] RAID / stripe block storage >>>>>>>> volumes >>>>>>>> >>>>>>>> Hi Robert, >>>>>>>> >>>>>>>> Can you elaborate on "multiple underlying storage services"? >>>>>>>> >>>>>>>> The reason I asked the initial question is because historically >>>>>>>> we've made our block storage service resilient to failure. >>>>>>>> Historically we >>>>>>>> also made our compute environment resilient to failure, too, but over >>>>>>>> time, >>>>>>>> we've seen users become more educated to cope with compute failure. As >>>>>>>> a >>>>>>>> result, we've been able to become more lenient with regard to building >>>>>>>> resilient compute environments. >>>>>>>> >>>>>>>> We've been discussing how possible it would be to translate that >>>>>>>> same idea to block storage. Rather than have a large HA storage cluster >>>>>>>> (whether Ceph, Gluster, NetApp, etc), is it possible to offer simple >>>>>>>> single >>>>>>>> LVM volume servers and push the failure handling on to the user? >>>>>>>> >>>>>>>> Of course, this doesn't work for all types of use cases and >>>>>>>> environments. We still have projects which require the cloud to own >>>>>>>> most >>>>>>>> responsibility for failure than the users. >>>>>>>> >>>>>>>> But for environments were we offer general purpose / best effort >>>>>>>> compute and storage, what methods are available to help the user be >>>>>>>> resilient to block storage failures? >>>>>>>> >>>>>>>> Joe >>>>>>>> >>>>>>>> On Mon, Feb 8, 2016 at 12:09 PM, Robert Starmer <rob...@kumul.us> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I've always recommended providing multiple underlying storage >>>>>>>>> services to provide this rather than adding the overhead to the VM. >>>>>>>>> So, >>>>>>>>> not in any of my systems or any I've worked with. >>>>>>>>> >>>>>>>>> R >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Feb 5, 2016 at 5:56 PM, Joe Topjian <j...@topjian.net> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> Does anyone have users RAID'ing or striping multiple block >>>>>>>>>> storage volumes from within an instance? >>>>>>>>>> >>>>>>>>>> If so, what was the experience? Good, bad, possible but with >>>>>>>>>> caveats? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Joe >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> OpenStack-operators mailing list >>>>>>>>>> OpenStack-operators@lists.openstack.org >>>>>>>>>> >>>>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> OpenStack-operators mailing >>>>>>>> listOpenStack-operators@lists.openstack.orghttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> OpenStack-operators mailing list >>>>>>>> OpenStack-operators@lists.openstack.org >>>>>>>> >>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> OpenStack-operators mailing list >>>>>> OpenStack-operators@lists.openstack.org >>>>>> >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators