Hello,
We've previously tested RGW with small objects in the past (about 100
million in objects in the few KB size range) with emperor.
We found that RGW ground almost to a halt. Based on discussion at the time,
I understood the problem to be related to the index pool in which RGW
stores the object indicies and the fact that RGW must load the entire
bucket index into memory from this pool. That is, the problem is that RGW
has a very sub-optimal method for updating the index.
The suggestion was to modify the client to shard writes across lots of
buckets. I modified a PHP client to test this and I saw some speedup in a
test. But then firefly came out and I just put the project on hold for a
bit.
I'm coming back around to this problem. Based on the firefly release notes,
I thought the inclusion of a levelDB backend was to address this problem.
I've read through Sebastien Han and Haomai Wang's discussion of the KV
backends and it seems that these are meant as general optimizations to the
use of storage devices and not as a panacea for the RGW index problems?
For the very particular case I need to solve at the moment, running a
custom client is possible. In my testing with a modified PHP client, I
modified the "openBucket" command to take a second parameter, the number of
bits to use for determining bucket placement. That is,
openBucket("mybucket", 10) would shard over 1024 buckets with names like
"mybucket-shard-0", "mybucket-shard-1", ... "mybucket-shard-1023". For the
purpose of this discussion "shard exponent" refers to the number of bits to
use (the exponent on a 2^N calculation for number of shards). This presents
several problems:
1) This shard exponent needs to be "fixed" into the application. And all of
the clients accessing a bucket need to agree on it (unless they don't, see
below).
2) In theory it's possible to make this value dynamic, by using a "hunting"
algorithm on reads. That is, if an object hashes to shard 1023, then try to
read it. If the read fails, reduce the exponent by 1, find the new bucket
number and try reading again from the bucket defined at that lower shard
level. Keep repeating the process until you find the process or cannot
reduce any more. This allows one to change the factor on a running bucket
using such a sharding scheme, but imposes several costs.
2a) In the general case, reads will require O(exponent) reads to RGW to
find the bucket containing a given object.
2b) Data will be heavily skewed towards lower shard numbers, assuming
that the exponent is gradually extended over time because these shards will
contain all of the data that's supposed to be in them at the current
number, plus the additional data that was supposed to be in them at the
lower shard exponent.
3) #2b can be solved by re-balancing the objects after raising the shard
exponent. However, this re-balance will require re-placing about 1/2 of the
objects (or an O(n) re-balance). With hundreds of millions of objects such
a re-balance will materially degrade performance for a significant length
of time.
4) In theory the shard exponent and other configuration information might
be reasonably stored in a shared config file (e.g.
mybucket-config/config.json or something). But this requires the client to
perform even *more* reads to the cluster for every operation.
5) Much of the usefulness of RGW ("drop in replacement for S3") is lost.
Also, all of this seems academic, since according to an earlier email from
Yehuda, bucket index is shared across buckets, so sharding doesn't actually
solve the performance problem (which seems to bear out in my earlier
testing).
I've also considered that perhaps the correct solution is to modify RGW to
use a more optimized data store (perhaps an in-memory index built on e.g.
redis for bucket indexes). Unfortunately, while I've gone spelunking in the
RGW code, I haven't identified the exact code that reads the index (or
maybe I have, but I haven't seen anything that strikes me as the source of
an index performance problem).
In any case, I would greatly welcome some suggestions about the state of
this problem, and where to go looking for the offending code.
All the best,
~ Christopher
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com