Re: Adding Data-At-Rest compression support to Ceph

Gregory Farnum Wed, 23 Sep 2015 11:04:15 -0700

On Wed, Sep 23, 2015 at 8:26 AM, Igor Fedotov <ifedo...@mirantis.com> wrote:
>
>
> On 23.09.2015 17:05, Gregory Farnum wrote:
>>
>> On Wed, Sep 23, 2015 at 6:15 AM, Sage Weil <s...@newdream.net> wrote:
>>>
>>> On Wed, 23 Sep 2015, Igor Fedotov wrote:
>>>>
>>>> Hi Sage,
>>>> thanks a lot for your feedback.
>>>>
>>>> Regarding issues with offset mapping and stripe size exposure.
>>>> What's about the idea to apply compression in two-tier (cache+backing
>>>> storage)
>>>> model only ?
>>>
>>> I'm not sure we win anything by making it a two-tier only thing... simply
>>> making it a feature of the EC pool means we can also address EC pool
>>> users
>>> like radosgw.
>>>
>>>> I doubt single-tier one is widely used for EC pools since there is no
>>>> random
>>>> write support in such mode. Thus this might be an acceptable limitation.
>>>> At the same time it seems that appends caused by cached object flush
>>>> have
>>>> fixed block size (8Mb by default). And object is totally rewritten on
>>>> the next
>>>> flush if any. This makes offset mapping less tricky.
>>>> Decompression should be applied in any model though as cache tier
>>>> shutdown and
>>>> subsequent compressed data access is possibly  a valid use case.
>>>
>>> Yeah, we need to handle random reads either way, so I think the offset
>>> mapping is going to be needed anyway.
>>
>> The idea of making the primary responsible for object compression
>> really concerns me. It means for instance that a single random access
>> will likely require access to multiple objects, and breaks many of the
>> optimizations we have right now or in the pipeline (for instance:
>> direct client access).
>
> Could you please elaborate why multiple objects access is required on single
> random access?


It sounds to me like you were planning to take an incoming object
write, compress it, and then chunk it. If you do that, the symbols
("abcdefgh = a", "ijklmnop = b", etc) for the compression are likely
to reside in the first object and need to be fetched for each read in
other objects.

> In my opinion we need to access absolutely the same object set as before: in
> EC pool each appended block is spitted into multiple shards that go to
> respective OSDs. In general case one has to retrieve a set of adjacent
> shards from several OSDs on single read request.

Usually we just need to get the object info from the primary and then
read whichever object has the data for the requested region. If the
region spans a stripe boundary we might need to get two, but often we
don't...

> In case of compression the
> only difference is in data range that compressed shard set occupy. I.e. we
> simply need to translate requested data range to the actually stored one and
> retrieve that data from OSDs. What's missed?
>>
>> And apparently only the EC pool will support
>> compression, which is frustrating for all the replicated pool users
>> out there...
>
> In my opinion  replicated pool users should consider EC pool usage first if
> they care about space saving. They automatically gain 50% space saving this
> way. Compression brings even more saving but that's rather the second step
> on this way.

EC pools have important limitations that replicated pools don't, like
not working for object classes or allowing random overwrites. You can
stick a replicated cache pool in front but that comes with another
whole can of worms. Anybody with a large enough proportion of active
data won't find that solution suitable but might still want to reduce
space required where they can, like with local compression.

>> Is there some reason we don't just want to apply encryption across an
>> OSD store? Perhaps doing it on the filesystem level is the wrong way
>> (for reasons named above) but there are other mechanisms like inline
>> block device compression that I think are supposed to work pretty
>> well.
>
> If I understand the idea of inline block device compression correctly it has
> some of drawbacks similar to FS compression approach. Ones to mention:
> * Less flexibility - per device compression only, no way to have per-pool
> compression. No control on the compression process.

What would the use case be here? I can imagine not wanting to slow
down your cache pools with it or something (although realistically I
don't think that's a concern unless the sheer CPU usage is a problem
with frequent writes), but those would be on separate OSDs/volumes
anyway
Plus block device compression is also able to include all the *other*
stuff that doesn't fit inside the object proper (xattrs and omap).

> * Potentially higher overhead when operating- There is no way to bypass
> non-compressible data processing, e.g. shards with Erasure codes.

My information theory intuition has never been very good, but I don't
think the coded chunks are any less compressible than the data they're
coding for, in general...

> * Potentially higher overhead for recovery on OSD death - one needs to
> decompress data at working OSDs and compress it at new OSD. That's not
> necessary if compression takes place prior to EC though.

Hmm, that is an interesting point. I guess I'm just not sure about the
labor and validation tradeoffs involved in obtaining this (it really
seems like the only advantage to me).


...I should note that I'm under the impression that transparent
compression already exists at some level which can be stacked with
regular filesystems, but I'm not finding it now, so maybe I'm
misinformed and the tradeoffs are a little different than I thought.
But I still don't like the idea of doing it on a primary just for EC
pools – I think if we were going to take that approach it'd be easier
to compress somewhere before it reaches the EC/replicated split?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Adding Data-At-Rest compression support to Ceph

Reply via email to