On 08/02/2013 06:22 AM, Xavier Trilla wrote:
Hi,
We have been playing for a while with GlusterFS (Now with ver 3.4). We
are running tests and playing with it to check if GlusterFS can be
really used as the distributed storage for OpenStack block storage
(Cinder) as new features in KVM, GlusterFS and OpenStack are pointing
to GlusterFS as the future of OpenStack open source block and object
storage.
But we've found a problem just when we started playing with
GlusterFS... The way distribute translator (DHT) balances the load. I
mean, we understand and see the benefits of metadata less setup. Using
hashes based on filenames and assigning a hash range to each brick is
clever, reliable and fast, but from our understanding there is a big
problem when it comes to storing VM images of a OpenStack deployment.
I mean, OpenStack Block Storage (Cinder) assigns a name to each volume
it creates (GUID), so GlusterFS does a hash of the filename and
decides in which brick it should be stored. But as in this scenario we
don't have many files (I mean, we would just have one big file per VM)
we may end with a really unbalanced storage.
Let's say we have a 4 bricks setup with DHT distribute, and we want to
store 100 VMs there, so the ideal scenario would be:
Brick1: 25 VMs
Brick2: 25 VMs
Brick3: 25 VMs
Brick4: 25 VMs
As VMs are IO intensive it's really important to correctly balance the
load, as each brick has a limited amount of IOPS, but as DHT is just
based on a filename HASH, we could end with something like the
following scenario (Or even worse):
Brick1: 50 VMs
Brick2: 10 VMs
Brick3: 35 VMs
Brick4: 5 VMs
And if we scale this out, things may get even worse. I mean, we may
end with almost all VM file in one or two bricks and all the other
bricks almost empty. And if we use growing VM disk image files like
qcow2 the option "min-free-disk" will not prevent all VMs disk image
files being stored in the same brick. So, I understand DHT works well
for large amount of small files, but for few big IO intensive files
doesn't seem to be a really good solution... (I mean, we are looking
for a solution able to handle around 32 bricks and around 1500 VM for
the initial deployment and able to scale up to 256 bricks and 12000
VMs :/ )
So, anybody has a suggestion about how to handle this? I mean so far
we only see two options: Either using legacy unify translator with ALU
scheduler or either use cluster/stripe translator with a big
block-size so at least load gets balanced across all bricks in some
way. But obviously we don't like unify as it needs a namespace brick,
and using stripping seems to have an impact on performance and really
complicates backup/restore/recovery strategies.
Another suggestion that you may want to try is, have your GlusterFS node
also serve as OpenStack Cinder and use NUFA[1]
~shanks
[1]
http://gluster.org/community/documentation/index.php/Translators/cluster/nufa
_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users