On 02/25/2014 06:52 AM, Ilya Dryomov wrote:
> On Mon, Feb 24, 2014 at 4:59 PM, Alex Elder <[email protected]> wrote:
>> On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
>>> This is primarily for rbd's benefit and is supposed to combat
>>> fragmentation:
>>>
>>> "... knowing that rbd images have a 4m size, librbd can pass a hint
>>> that will let the osd do the xfs allocation size ioctl on new files so
>>> that they are allocated in 1m or 4m chunks. We've seen cases where
>>> users with rbd workloads have very high levels of fragmentation in xfs
>>> and this would mitigate that and probably have a pretty nice
>>> performance benefit."
>>>
>>> SETALLOCHINT is considered advisory, so our backwards compatibility
>>> mechanism here is to set FAILOK flag for all SETALLOCHINT ops.
>>>
>>> Signed-off-by: Ilya Dryomov <[email protected]>
>>
>> I have a few small comments for you to consider but this
>> looks good to me.
>>
>> Reviewed-by: Alex Elder <[email protected]>
. . .
>> The other thing is that the expected size is limited by
>> rbd_image_header->obj_order, which is a single byte. I
>> think you should encode this the same way. Even if the
>> hint were for more than RBD, this level of granularity
>> may be sufficient.
>>
>>> + u64 expected_write_size;
>>
>> Probably the same thing here, a byte might be enough
>> to encode this by using log2(expected_write_size).
>>
>>> + __u8 expected_size_probability;
>
> I think in the interest of genericity expected_object_size should be an
> arbitrary, not limited to a power-of-2 value. Now, given the current
> 90M object size limit 64 bits might seem a lot, but extent offset and
> length are 64 bit values and to be future-proof I followed that here.
I have no objection to the 64-bit size but I still think
a byte representing log2(size) is sufficient. Power-of-2
granularity is most likely fine (and might even be what
such a hint gets converted to anyway) for file systems
or other backing store. But again, you can do that with
a 64 bit value as well.
> expected_size_probability is currently unused. It was supposed to be
> a 0-100 value, indicating whether we expect the object to be smaller
> than expected_object_size or not and how strong that expectation is.
>
> I'm not sure if you've seen it, but this (both userspace and kernel
> sides) were implemented according to the design laid out by Sage in the
> "rados io hints" thread on ceph-devel about a month ago. There wasn't
> any discussion that I'm aware of in the interim, that is until the
> recent "wip-hint" thread, which was triggered by my PR.
I'm aware of it---I saw the discussion go by and skimmed
through it but did not look at it very closely. I can't
keep up with all the traffic but I do try to pay attention
to code for review.
>> I'm not sure why these single-byte values use __u8 while the
>> multi-byte values use (e.g.) u32. The leading underscores
>> are supposed to be used for values exposed to user space (or
>> something like that). Anyway, not a problem with your change,
>> but something maybe you could check into.
>
> I'm not sure either. I vaguely assumed it's related to the fact that
> single-byte ceph_osd_req_op fields are directly assigned to the
> corresponding ceph_osd_op fields which are exported to userspace,
> whereas the multi-byte values go through the endianness conversion
> macros, which change the type to __le*.
Not a big deal regardless.
>>> + } hint;
>>> };
>>> };
. . .
>>> + /*
>>> + * CEPH_OSD_OP_SETALLOCHINT op is advisory and therefore deemed
>>> + * not worth a feature bit. Set FAILOK per-op flag to make
>>> + * sure older osds don't trip over an unsupported opcode.
>>> + */
>>> + op->flags |= CEPH_OSD_OP_FLAG_FAILOK;
>>
>> I think this is reasonable. The only thing I wonder about is
>> whether there could be any other failure that could occur on
>> an OSD server that actually supports the op that we would care
>> about. It's still advisory though, so functionally it won't
>> matter.
>
> Currently there isn't any such failure. In fact, the only thing that
> this FAILOK hides is the EOPNOTSUPP from an OSD that doesn't support
> the new op. Everything else is done on the backend, and there all
> errors are explicitly ignored because this is an advisory op and it
> would bring down the OSD otherwise.
Sounds good. Thanks for the explanations.
-Alex
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html